[00:00:05] >> Well there's some very much for the 4 you might need to give a shot talk here. Today my talk is about that I can. Just I'm trying to understand the 1st all or all of them for non-comic selfish emotion earning and basically about us the Catholic read anything that just. [00:00:24] OK So let's start so my talk is a connection of my trying to work with. Just people you know. My students my colleagues and also my friend at Princeton and also. Just people in. China and I want to remark her to this guy. He's a he's a 1st officer of the I mean of the paper from a 2nd part of this talk and he's only he was well known well he did well he did finish the project he was only like a 30 year undergraduate. [00:01:04] And this guises really moving OK. OK So let's start with the background so this talk I will men they talk about the noncom X.T.. Problem in machine learning and I will talk about a 2. To 2 kind awful 2 things 1st thing is so the hearts and the sensing is local to me. [00:01:34] And I will show that whole. New insights and then call myself a nation earning and develop a better organism and also I'm trying to connecting research from different fields apply probability machine learning automation the statistics. OK it's all the motivating example of non-comic operation missionary. I mean this is the most well known example. [00:02:03] So deep neural network saw neural networks are not new it was developed like a duck has a goal. And what it will start over time just like a tree and your network need to follow a highly one co-ax automation problem with very complicated and landscape and is supposed to be computationally intractable and because the network is a very large saw eventually we will get like over fatal results and suffer from curse of dimensionality this is what we start like tactical but things have changed now the missing the noise like we find that a training deeper network optimization problem all very complicated but it has some kind of nice ultimate in landscape but still beyond our understanding and this nice lunch even asked to trim efficiently train your network by a very simple algorithm likes the category. [00:03:00] And. The result to me you will find the neural net or stool are still trying to or thinking or data but actually it turned out he generalized very well so this is also surprising So you enjoy it as we are largely unknown series just we don't really understand what the pure outworks are doing and also all these. [00:03:22] Training actually heavily rely on hacks so all those hacks are still occur just make the static gradient is an algorithm even more mysterious so all our question is that. Is is the classic reading descent really working always just as some. So there there had been like a different perceptions on people earning saw all sorts I was in I could read it reflects the cultural difference between a junior in community and Syria community so for fall for practitioners I would say like establishing. [00:04:01] Siri is very challenging all some hucks me looks very natural I need to it so there's little I'm saying say so and also what are what have we were presenting just later Sly's just 1st I want to I will say what I represent will have a gap between practice and Siri so I'm not finishing these problems for so practical problems are still beyond our understanding so I'm trying to start with simpler but non-trivial examples so. [00:04:37] You will see the examples and I'm trying to solve this by by studying these non-trivial examples I'm trying to get a better understanding of optimization algorithms the 1st algorithm I want to. I want I want to talk about is streaming principle comp and Isis so. It's all very simple. [00:05:01] So it's a very simple observation problem looks very simple just we I mean mining a crappy function have to follow not a constant and. So because there's streaming that means that here this X. is a random actor and we have X. Y. X. 2 X. 3 there are some pulling from the same distribution and it was mean they're all color and sigma and this object this ultimate in problem is trying to find us a bunch of leading edge of the actors so why do we care about PEACEY. [00:05:32] So P.C. actually. You this non-contact operation for one true. Problem I can find the machinery saw that's what I mean is also known to be solvable in polynomial time and it also have this kind of we call it a strict saddle properties and. I will spend later and by studying this streaming P.C.A. we hope to get some better understanding of non-home X. often of non-color X. for us all their operation so P.C. has no spirit all call tomorrow so. [00:06:10] So let's start with the. Little bit more generalization like we can string a function F. and the critical point also point of this after is if I as the grading equal to 0 so here I'm thinking about unconstrained problem so the grading equal to 0 Give us a solution of point and if it is this point is a spiritual call to modest means that is only a little bit in a small neighborhood but this global is not optimal. [00:06:38] But the global to a global optimum means that. This F.X. indeed achieved the the the the minimum the global minimum. And also we have strict saddle point the point is that the Grady has little but at that point used to have negative current or so you can kind of escape from it so these are just the severest of. [00:07:05] Global and strict set of points and the turn of that piece here you don't need a bar resist spirit. So all the call tomorrow I call Obama so this is kind of cool the property of being P.C. So 1st reading P.V. we are more interesting we are more interesting Anderson ending that whole 1st order of reason like a can so why 1st all the organism cannot be trapped at a. [00:07:31] Set of points. And the over them I'm using is called the casting amount of for the gradient because we have that's also an additive constant So you see actually we are doing operation over. Stuff you manifold and so that's why I use this casting a manifold really algorithm just this part has a gradient and this is part of the vision or part of the manifold agreed and different points on the staff you're manifold is defined as the metaphor the gradient equal to 0 and also the negative. [00:08:05] On the matter for the company. The minimum making value of the hashing over the manifold more and they are all smaller and smaller and they're all OK so this is. So for for those who are not familiar with manifold off of Asia just why we care about a leading even actor pca and we reduce the problem to this. [00:08:30] What I think minimization over severe and so classic reading also fear is just is trying to. Is trying to find that the sun direction. Just under severe So this is part is more like a projection matrix who to get rid of the the basis you just a little bit. [00:08:56] OK then I'll introduce our mall. A subtle condition called assumption so here I'm only assuming that we are trying to find the leading edge of actor so all I assume that the largest value is greater than the 2nd value of the chorus matrix so all the legal matter of this color is metric sigma is identifiable only up to sign change so this is the assumption we need for. [00:09:29] Because we want to make the medium either. Kind of unique. And also just if we started to study a little bit more we will find that. If you i Party and also a good actor of Sigma associative is the. Guy And we know that the leading a good actor is the global minima and. [00:09:53] The eagle actor Chris Smalling to the smallest you can I will give us the global maximum and all odds are good actors give our strict saddles which means that as I point you have never heard her and therefore rank for if we care about like a multiple run case pca just essentially we just replace these as the supply of multiple Even actors that are written similar so I will skip the ditto OK Now let me explain a little bit of its operation make a behavior so because there is a short talk so I skip or the technical details the proof a space and if you are an approximation of the category and it's a technique actually from simulation optimization community so where a powerful technique can help us to understand the dynamics of the algorithm so basically. [00:10:43] We latch the stepsize of that to the cast agreed auguries and go through 0 and then the discrete time solution productive we can work towards solution of the ordinary differential equation and by studying this alternate differential equation we can get understand the global dynamics and also if you use a different scaling while you shrink step cycle so they're all the discrete time productively can also we can work through the solution of stochastic differential equations and by using this by studying this this is the classic different question we can understand of the local that I mix of the algorithm. [00:11:19] You actually you will find that I asked you before stringing pca have Sri faces in Congress so the 1st one is and I mean the 1st one is called Escaping from several points so if you average them very unfortunately usual lies at the saddle point and. The dynamics of the algorithm can be characterized by a process with us Daddy study True true and false pointing away from the subtle point so it help you escape from the saddle point and also the truth has actually come from the just as a starting point the truth has actually come from the noise the noise of the. [00:12:00] Of the of the as a gradient make of the make makes a solution stable as you know it is a stationary part so he took a scope from it and after escaping from a saddle point. Is still the little that wrist can still be characterized by all process but. [00:12:19] You'll find you'll see that the truth is actually dominating the virus so it's evolved directly towards that in our ultimate So almost like a deterministic process and Hewitt to achieve a small beating around the optimum So here is that you laceration and eventually when you are when the solution is approaching the optimum you will find out now the algorithm behave like a Also eating all your process with a gradually dying out drift so it's almost like an early bias around the walk Center at the optimum So this is how we used if your approximation to understand that then I mix so is more like a continuous times and I mix so eventually we have this kind of I sum total rate how courageous so for sufficiently. [00:13:10] Greater and they're all and if we take this step size and we need we need at most the these kind of number of iterations to achieve even offer a solution. So this is a party result because we require that they're all so here or. Even also goes for THEIR us all so all but this is actually this is actually always not I some nice in public but it gave our Somalian curative description of all resume. [00:13:41] Came from subtle points and occurring to the media and also it can be applied to more complicated algorithm so that's the benefit that's the advantage of ice I'm talking analysis and then we so here is some Yuriko you lost patience all the over them here the solution is initialized at a certain point and then eventually come to the optimal so you can see for face one just that this part is keeping from 7 point and this part is just try worse between stationary points and eventually here just we are converting to the Optima so we can see for escaping from saddle and learning to the optimal also. [00:14:23] Both business just algorithm tends to be very slow but you know it's faster so OK I want to briefly discuss that I study actually escape from strict set of points and But here I want to talk our strict set of points when that encourage or just exist just so we can find the the smallest amount of the Haitian matrix modern and they're all but 40 generally the sidelines saw the serious still largely unknown and also we can extend the result to momentum a stream. [00:14:59] That when there exist momentum the whole momentum help me escape from saddle and also we also can also use the Dyson part analysis to analyze. The momentum as distributed and then call X. automation synchronous manner. And here just for some of you who are more interested in nice and hearty than ours we also have a discreet time we have Russian more like a discrete edition of the Catholic differently question and that we show that the night is not the riddle Caruthers actually matches the results we presented in one eye some type of analysis in the some optimistic opinion on several points is that many people believe that string or natural actually a set of points are common and so that's a little optimistic OK. [00:15:51] So that's all that's about the 1st part I will briefly go through the 2nd part in the harmony. OK OK So the 2nd example here I'm talking about is a single coalition or not for this common looting or that works extremely simple just. Basically we have this. Goshen input. [00:16:15] And this double star is a combination filter and this is a J.-Star is the weights of the 2nd layer this activation is just a little activation and here I am poles this start too long to be 0 because. Activation is homogeneous so I had to make this model identifiable so I imposed this severe constraint and here I assume the include is from normal 0 identity and. [00:16:45] So this is what we call this one is a Teacher Network and the double star is there is called the 2 primed and we solve this one call X. alternation problem basically we have another network we call it the student network we are minimizing the loss function. To get us to the network Eventually I recall is to make Still the network or words to the Teacher Network and this alternation problem actually is more difficult than stringing pca you know science that it have a non-trivial spirital coffee mug. [00:17:20] So if you run the means by this and you think reading is that the greedy dissent will have will be attracted to this. By the local optima Wisconsin The problem is that at least the one force OK and here is the gist of the architecture you can see we call it an hour because the color Lucian actually has an aura. [00:17:45] So this will be the simple 5 to 2 to make the problem see are adequately manageable OK and there's all this is architecture and then here we started off with one called Purpose that has a great in descent is very similar to greeting the sound byte or we we inject some independent noise to it in the reader with all the induction is because the unit hundreds of the interaction noise and the logical forms of the inject noise just those are nice and can help us here I cannot assess these are bit of artificial because I don't know how to prove force. [00:18:23] So I use this a purpose that has a grid and hear the difference. Here just I'm using something called noise and hearing because I can jack noise to it but the injecting noise Actually I'm using a noise level schedule basically study basically saying that I'm doing a multi i Pod cast and of all 40 approx I'm using larger I'm using noise with larger virus and I gradually decrease the wires and so at a later hour pokes the viruses is small so so here. [00:19:06] So for purpose to cause a gradient it's different from acidy because for S.T.D. the virus so far as the virus of the caste gradient actually depends on the step size but here what I'm doing is actually because our noise is injected so all. We actually decouple the stepsize and the noise so it's also all fall for using the same article analysis. [00:19:32] And so fast you need to do now as and using your 8 people to that size on you so it's a serve the similar purpose so we're all really behavior here I want to prove an emotion that I think to the organism is essentially doing some kind of law. [00:19:51] The Russian of the operation problem so if you thought this is the original landscape and if you use small noise Union Jacks more noise to the all rhythm than you are doing some kind of just not very strong. But if you use large noise you can see compound result regional landscape this landscape is more marshals or so this is kind of the intuition behind but it turned out had to actually not really the case so here I'm only giving a nice straight of example it seems that after not the Constitution of the noise than this landscape is nice enough but turn out the case very it's very complicated so here I want to do. [00:20:40] A notion called a partial there's a pit of condition so this does appear the condition is very something similar to come X.D. but. Different You can see here. This is basically the expectation of The Pervert. The Catholic read it and then it means that you're updating the action and. [00:21:07] In Also although the direction towards the optimum. The angle is smaller than. The space is smaller than 2. And this is for W. and this is for the output with 8 and the difficulty is that actually both conditioners do not hold globally and also all. Those condition might not hold at the same time and also all those conditions can vary as the noise level worries So these are the challenges when we do analysis and this Iraq or result how us that. [00:21:49] If we do if we probably learning rate and also probably noise level schedules and use polynomial mean abashed eyes and we can guarantee that the arboretum are cheap so often the solution polynomial time so this plan is still very crude but given the difficulty of the problem I'm so far satisfied with there is this result and our analysis show that they're basically a fish transition so I had to order our parks this preserve as the cask reading is trying to or why it all escaped the local optima when the noise is large but the noise it reduced to a small to a certain level then. [00:22:31] After after a certain hour post we reduce the noise to a smaller level and then the purpose that has agreed and start to come work to the global from so there exist some face transition and here is some New Yorker results. The cure is the success rates of corroding to look at Obama for purpose that has created greeting and the stock has agreed that you can see progress that has green and stochastic gradient they all have like a 100 percent profit just the success rate. [00:23:03] But the fog reading is that you can see here for some difficult case that. You have like a 50 percent goes to the spiritual quantum So here is also some urine core so the behavior of purpose that has agreed in this very similar to a stick has agreed just but we just don't know how to analyze the cancer. [00:23:26] OK just a discussion saw size and union in practice has a very similar effect on our own controlling the noise similar to our noise and yelling and. Actually this is something I want to highlight we find that your analysis will find that the purpose to it has agreed to behave differently for training different layers it seems that at the earliest stage the preserves the cast during the 1st the trend of the outer layer and then trimmed the convolution there so it has a different kind of full priority and also for all of your network there exist a many bad the local optima up is an empirical observations which cannot generalize but we don't know that of either. [00:24:17] Came from is also bad the globe Octomom for the same reason I present who I present here so this is a bit of. Investigation and also to the past all our knowledge of this is a 1st as there are stories out to all the justifying the fact of noise in training your networks biased Kazik reading type algorithm you know the precise answer of the birds all to basically we are justifying the benefit of noise in one color exaltation which is I don't think there are many results in automation literature trying to justifying the benefit or noise OK. [00:24:56] So some summary just. The key is. Yes Well there are still many over and probably off mission I just I will the see more people like a start working on this area so I'll just really exciting OK so all the roughest. Thank you.