Thank you very much. Crystal for this very nice introduction. Can you hear me very good. So it's a real honor to be distinguished by this lecture and I'm very happy to be here as always this kind of lecture is a little bit intimidating for the speaker which means in particular I'm not completely sure what the public what I should really say here and so of course I have slides but I've decided to do as much as I can on the board. And at some point maybe I will switch to slides because it gives me Year possibility to adapt. So even though nobody ever does this if if you think that I'm worthy of target. Just tell me what that is you know if I'm too slow too fast. Whatever. So I want to talk about something which for me is a nice personal story as much as much as a mathematics story. And it's this question of complexity of random function so if you want to take only one thing. From this talk of the following very very sentence of course a vague sentence is always a bit right so we'll see if they do bait and switch but so this sentence is something like this functions. Let's say you can now come to the thing is that don't have any reason. Yeah. So can you read blue and black and white here. So let's see. It's most functions of many variables. So here I could say tend to be or can be very complex. It's no. This is vague. And there are many things that we can discuss about that and of course the lecture will be exactly that. So we first so of course say here I'm saying smooths function so for everybody. It means functions which have as many derivatives as you want. OK so I'm not talking Brownian motion here complicated functions that type smooth function think if you're mathematician and you want to think of one such function think of function pull in Normal think a homogeneous polynomial. OK of degree three something I'm not talking crazy things here. So smooth functions. Of many variables so I have to explain how many how many how much is many. And. So just an example. For the applications I have in mind in deep learning. This is like ten to these six seven eight. And so tend to be I have to explain what tend to be mean and of course what does complex me. OK So all that I will do. But the first thing is why do we care. Right that's. It's an interesting phenomenon that is really universal and in some instances mathematically provable in others just yet to be proven proven but why do we care so why do we care about functions of very many variables. So what with this talk would be about is to say that functions are VERY of many variable. Are complex and then try to say why this is important and also classify the types of complexity that come there. So this is in fact so of course. So I'm not sure I imagine here everybody understand what this was function and complexity can be but let's try to function of one variable. So here's a function to graph. Function of one variable and here is another from the. So get with the next X. is here of what X X is way axis whatever. And here is another graph of a function of one variable one is not many but I cannot draw in ten to the six variables so here is another graph. Right. So both functions are small who's obviously the second one this one is complex and this one is not. OK So well OK so. So why is this one more complex. Of course one way to look at it is just to look at the you have plenty of local minima local maxima and over there you have you don't have that. So here this one has many critical points. So let me make sure everybody is with me here. If you take a function of very many variables. Or any number of variable it's a F. and variables an end will be large. So of course we'll know what the gradient isn't what the Haitian is so let me make sure that we know that. So the greater the best will simply be because at a point X. will simply be the collection of the derivatives of F. X. essentially So that's the gradient and the the other object is of course the hessian. Which I will note here by this the patient it's just that the Matrix made of all the second partial derivatives of F. at X.. So we learn that in Kalak three so multivariable calculus. And so I hope we all know that without. It is so a critical point is simply a case a place where the gradient is your so just critical points a point is critical means the gradient of F. is zero. And then when you have a critical point you may want to look at whether it's a local minimum local maximum or something else and so for that you look at the hessian and this is a real symmetric matrix of course you can look at its eigenvalues and the index of such a thing. You write it here so you can read the index of a critical points is simply the number of negative eigenvalues of the Hessians of the Hessian So just to make sure that everybody is with me on that. So whether that means of course it simply means the number when you have a critical point you have direction that go up there in direction that go down the index of the number of direction that go down. So a local minimum for instance is a place where the index is zero. OK. So here of course in Dimension one typical critical point here can be only either a local minimum like this all equal maximum This is index zero this is index one here you have one direction goes down one direction that goes up. OK. So. What I mean here by functions of many variables so function like this. So first a function will be the space of parameters if you want to think simply think aren't. But in fact the natural space on which will be working will be what Geometers called manifold manifold of dimension and again which of course you also have a notion of. In seclusion of great Internet. And it's a remand in manifold and hessian. And so we know what's was function of many variables mean I still have to say what tend to be but what does complex mean I would say for me complex would mean. It has not many but very many critical points. And if you want you could or say for instance local minima or local maxima like this function. So I do. I mean by very many here. It's essentially exponential in a dimension. So when you have ten to the eight variables like in the problem of the burning. Typically you have ten to the ten to the eight critical but it's one followed by one hundred million euros. So that's complex. OK that's this you know this kind of wiggling. But the number of things like this tend to the ten to the eight. OK All right so I say. So this is the message except of course as as it stand this is crazy because take the function in any variables to be X. one cube this one doesn't have many critical points right. There's only one soul. I mean no X. one equals zero one one one set. So if you're on the sphere X. one cube is only two critical points. So it's not exactly complex. So I tend to be you have to explain what that means. So what it will mean is the following it will mean the now I can specify a little bit the statement. It will mean if you if you pick a typical search function. It will be like this. But what does typical mean so on the probability it will mean a random function. OK so if you pick a function at random and of course I have to explain how rent how you pick one them functions. If you pick a function that random. It tend to be extremely complex to have an exponentially large number of critical points. OK And again you don't have to pick them in very bizarre functions. OK so now I know I can extend this statement and it's good that I change colors because you see how I change the statement small random functions are very complex with high probability. That's the kind of statement that I want to say OK so now if you pick a function at random of merriment a variable it will be very complex like all right so at least the statement is kind of clear but the before going further. I still have to answer the question why do we care so where do we need smooth random functions. So in fact I can answer this in the many ways one would be for the mathematicians in the room which is that it's a very natural question if I look at more theory was Theory tells me how the geometry of the manifold the topology of the manifold constrains the number of critical points. That's exactly what more theory says so but this gives bounds. And what I'm saying here is that when you're at large. Dimensions. Even on trivial manifold. Like the sphere. There is nothing simpler than the sphere in fact. So the bounds given by by more theory are very simple. In fact typical random functions would be extremely complex the landscape will look like that worse than that. So so it's natural to study more theory in high dimension and there are very interesting phenomenon appearing there. That's for the mathematicians for the non-mathematicians Why do we care. Right because that's you know that's the point if you have a geometer. But if you're not interested in the you know the geometer ask themselves very naturally have this landscape defined by a function I want to understand the the shape of it. OK Now if you are a little more practical that interested by the beauty of the landscape. Why do YOU care of course are functions of many variable over and but the question is why would they be random. So in fact this started with questions coming from. So if you come from physics existed. Where do you have a function of very many variable in physics all over but in particular when you do statistical physics. So the excised variables are maybe the positions the momentum of the spins whatever your configuration and F. is typically the Hamiltonian the energy of your configuration. So if you if you do I don't know if you think ising model. There's no reason for it to be like that it's not a random function. But if you think of a model of a system which is say a more first or a glass or something. Disordered a configuration of atom of spheres in a you know box that you shake is very natural to imagine that the interactions are random and so there is a fee. Other physics that has done that for a while and in particular for magnetism. Magnetic glasses spin glasses and that's called spinning glass and and there is a fantastic story there. Going back to sharing to make a Patrick an even earlier which has been and there is a picture emerging of the fact that this Hamiltonian landscape the energy landscape are very complex going back to Belize easy and it's cool and his cool. Of course. They have introduced an enormous amount of concepts about what this complexity is different classes of complexities that what is called the replica symmetry symmetry breaking scheme of per easy which to go. I mean proving what is called the parity formula took more than thirty years for mathematicians has been done by that I go on where I go but we're still very far from understanding or the results of the statistical physics on the rigorous ground. But the picture emerging is that these Hamiltonians are very complex. And this is where I come from in this story. So imagine of course the growing very large dimension and this is your energy landscape. If you know some physics of course and it can no think of any kind of that so. First what would be. So what's the difference in complexity between this and that. Let's look at that if you look at the level set right so let's look at the levels that you look at the place where the function has a value that's L. and this is my level set right so here is a very simple. It's those two points and if you look at the sublevel set the place where the function is below this right is just simple it's just this whole interval. Whereas here if you draw this right you see that the level set is way more complex than the sublevels had to hear it. So you. One of many many intervals. So. This kind of stretch and now if you think in physics you think in terms of the Gibbs measure that that would be the micro canonical measure if you think of the canonical measure. You can imagine that this its behavior would be very different here or there. OK And that's where the statics the equilibrium question of physics. Now if you think in terms of dynamics. It's also very different point. Now think now forget Physics think in terms of optimization you have a function of very many variables and you want to optimize it. That's a universal question in engineering. Why of course this this. You'd rather have this right because here you just follow the gradient of it appear not to say that convex optimization is not a hard and interesting question. But at least you have a hope of ending it and then at the bottom here. If you try to optimize and find the absolute minimum which is here. What kind of algorithm will you devote drive for that can be difficult. Right. If you just follow the gradient and start here. If you start here are good. But if you start here you are there if you have a ten to the ten to the eight of such traps hard to imagine that your. Of to musician process would be good. Now of course there are things to defeat this to be that you just add noise. So if you add noise. And when you are in the bottom of a well. The noise would get you out at some point and at some point you will end up being there. But the some point can be very very long. Right. So here we will see that for the kind of landscapes I'm describing the time to find the absolute minimum could be ten to the ten to the eight. And if this is. Reasonable algorithm you don't really want that. OK so let's So how did I. What is the kind of next thing I'm I want to talk about here. It's something that came up and recently for me which is deep learning. OK so. If artificial intelligence is so successful right now is because of the completely unreasonable success of deep learning. That is deep or only works fantastically well I will say what it is. And in fact it's of a thing a discipline where the results are so good but the but the proven results are Neal. But we cannot prove any kind of result saying that it will work well. OK's but nevertheless when then is very large. There is something very interesting happening which is what I want to describe. OK So let me. Maybe just explain why you would have so these two examples I have in minds are of in mind I could have others are spinning glasses. It's a statistical physics. So more generally it's disordered media you know these or or statistical physics type of things and then the second one will be a computer science machine learning deeper learning. So. So for the first one we can prove results for the second one we can look at them but but the phenomenon is the kind of the saying. All right so let me explain a little more now. So maybe before getting into the mathematics of this. Let me. I do a little personal touch which is I think a nice. Ory about what mathematics is supposed to do. So I have been working on spin classes and dynamics of these strange objects for a long long time. And in fact I was also working on another tool of which is called random it tricks theory and it took something like fifteen years for suddenly realized that these two things put together would solve the question of complexity. That's that happens all the time in mathematics where you work on something and we're going something else and you don't even realize that in fact you're doing the same thing and then you put them together and things are going well so that's not a story. The real story is that being a vice provost and all this kind of things which And as was said today and it's not. I mean it's it's on purpose. I decided to not be there for that as they so because you never know. You know some kind of kind of total disaster in the last day you'd better be away. So so so what happened there. So nothing to do with the science of this that I was for this type of thing I decided like everybody else but maybe a bit earlier or something like seven years ago six years ago. We built a center for data science and I've talked to your vice president for research here and you're doing the same. So and we in order to build this at N.Y.U. We mean it's not easy. You have to work hard to these things to make all sorts of branches of the university work together find resources for real estate find resources for positions and you know whatever. This is sort of your new degrees etc. But so in order to do that you need a leader and my leader was for this was my colleague who is a rock star of this type of things and a colleague got a grant and so he was the leader of this thing and we worked hard on those things for quite a number of months and years. At some point really tired of it. The end of a long day. I told him what I should have started with OK tell. What you were really doing so he told me about deep learning what he was really doing and his scientific adventure is really fantastic and he had spent a long time just before that essentially proving that deep learning was working for certain tasks and just talking to him long enough I just realized that gosh it's the same problem it could be just a crazy guess crazy prophecy could be the same problem in the same kind of thing could happen. He does have random function of many variables and maybe the same kind of complexity could be happening there too. Since then we have started a collaboration since and then look at his move to Facebook. Leads their research over there in New York. We've started a collaboration and everything we have predicted by analogy not really proving anything yet coming from statistical physics is kind of interesting on the other side and I will describe that. So in some sense what happened there was that mathematics was right in the middle. With mathematics did here between physics and computer science was just too abstract The problem is a question of geometry and once it's a question of geometry you look at it and you forget the specifics of the problem and then maybe the conclusion should be the same. In fact now what this will allow is in fact to exports many of the lessons that we've learned from let's say the Italian school of Spain Les's and the and Sherrington Kirkpatrick Tallis understand palmar and export all of this to the new concept for machine learning. So that's the personal touch. Now let's go back to more specific math. OK so is my goal understood. All right so now I have to know how to do this. Supposed to be on now. Now. Good. So I would start. So forget all this nice introduction that was supposed to whet your appetite and that there's a good. What's wrong with that. And so you know something. One of the first slide would be really terrible but in a way so as any good teacher knows you Google Now you go back and you and you start for something simple. So forget all the motivations. Let's think of one function of the simplest functions you can think of of many variables but so I said maybe the simplest of all would be a pull in normal. So that's what I would look at so. And let's look at home which it is polynomial So what is a polynomial it would be a function like this and variable. So it's just a sum of the normals let's say of degree. P. So P. would be my degree. OK And then here I must sum on all the P. topples i want i P. from one to N. and I need to put core fissions cure for my pulling normal right so let me call them. J.. So here is being last afficionados we realize what this is but so that's a that's a random but that's a polynomial. OK it's only James right. There's only did repeat as it makes anything. So if it's someone who knows. Instead of looking at it on the the. On R.N. I can look at it on was visit variable X. being on the sphere. Right so restricted to the sphere have a point on the sphere. And variable. So now how do I choose this to be random. I saw the simplest ways to choose those guys because fissions. To be a ID right so random identically distributed identity independently identically distributed. And let's say because that's of course always the simplest case to be gaussian. OK so I choose a confession a random pull in normal at random by choosing the coefficient randomly no simple model of that. So I take a pull in the mean of degree and let's take P. just to simplify things if you want take people three. So I take a random cubic OK hard to make simple things simpler than that you could say you could take a random pull number of degree one or a degree too so let's consider this as of course this would not work. You have the sphere if you take a pull in a lot of degree one X. one. Of course you have only two critical points which is the South Pole in the North Pole. So that's not very complex. Now if you take something of degree two. It's a quadratic form. You can of course diagonalize it and then obviously you have generously to N. critical points. So that's a lot if N. is ten to the eight but I'm but as soon as you go to degree three instead of having two when you have exponential N. and it's the same for all of. So let's look at that a little more precisely because so that's already. I think for those who've never thought about it really a really strange phenomenon. So let's look at that. First your some references their articles in mouth with my former students or Czerny there are recent work by aliens who for most it was DIDN'T of offers a to me and spends a lot of time at Grant since offers a twenty's no at Crown two and then there is works in physics. I would be working with a physicist quoted here be your only income or thought if you are often called Genco and or work in machine learning and you're the that's the team of young looking and I tore my skull Michael and Michael met you and another team with a live in single Ok so here's the plan now is the lecture so simple. First question is minimizing the cubic and then I go to the general question and then say some lip just a little bit of the mathematical tools and then go to spin classes and then to computer science. If time permits. OK So the simple question is what I'm describing here. So consider homage disappointed with three. Let's say some genius I restricted to the units fear and the question for instance is how easy it is to find its minimum that seems like a simple question. Finding the minimum the police don't know but now you understand from what I say that this is essentially impossible. In short time in finite time in less than the time of the universe. So of course the answer to spend a bit of the blue normal if you take X. one cube. That's a bull in a bit agree three. I mean the minimum is not difficult. So then you take what I just said you take an F. less special more general shape randomly and then just explain this is the polluted with three those core fissions are Gulshan What is the minimum. So the question is Is it easy to minimize F. through your preferred algorithm. OK say whatever gradient descent to test a gradient descent or long dynamics Whatever you. You like. And will the algorithm get to or near the minimum or stay stuck above it. Right. So if you let it run in decent time. So in order to understand that we need German information. How does the function looks like near its little values we have many minimum multiple wells are they separated by high barriers because you could have plenty of wells but separated by tiny barriers and then it's easy to go through them. What can you say about the shape. So that's a German trick question. The shape of the little sets. OK. So here are some answers with the normalization I had here. So again if you know let's look back at the function this function is Gaussian this is a Gulshan function. It's centered because I chose the judges to be centered so it's mean value is zero. The typical value is zero right. If you compute you see that on the sphere. The minimum here is of order the absolute minimum that I call M.N. is of order Square would have a negative square root of it. OK And more precisely you can PROVE THIS IS A HARD. This is a hard result. This minimum is negative. It is hero Times Square event and this is you're always one point six five seven. OK good. I'll tell you how you prove that this is hard. This is been thus there. OK Now if you try to minimisation algorithm. It will but get stuck at this level where infinity is one point six three three. So the absolute minimum is one minus one point six five and it gets stuck at minus point one point six three three. So this is where I convince Facebook that there was something to their. They're very proud of their algorithms and they tried it on this function and I predicted the three digit then they came back and said you're wrong. After a long time. It's stuck at one point six three five So. And then I was very relieved because for reasons I will explain if it had been stuck at one point six below this it was above this. So if were below that would have been a disaster for the theory but above it just meant that there are them had not run in long enough they wanted one a little more and of course they found this. So as I said Don't trust me try it and if you want to see the paper which does that look at that. OK so. You may think if I'm really interested by the minimum of this because this function is some performance loss function some whatever. Utility Function. What am I losing this may you know I'm stuck here so I find this value just not that bad compared to that of course I could tweak this problem a little bit and the two values could be more different. But typically what it says is like by doing something the eve you will get stuck at a level which is like a fraction of the real thing. It's already good news. It's not an in physics an infinitesimal fraction it could have been stopped at minus end to the power one fourth. It was not so you find the right order. But you don't find the right thing and you don't you cannot find the absolute minimum. But in typical algorithm. If you run for instance this is a recent result of the other Ph D. student if you run the last round dynamics for this type of thing. You can prove that the spectral gap is exponential in N. which means the time to get the absolute minimum is exponential N.. So you don't want it you can find it but it takes forever. Or. So so I said a run in Cuba is a very complex function and as I said the number of local minimize exponentially larger then these are facts and in fact something more interesting now which I haven't mentioned yet something about the structure of those minimal. So I think we are on the sphere of dimension then very large dimension and that we have this crazy landscape. So here is a kind of graph to explain this will force it's hard to draw anything in Dimension very large but if this is the sphere. This is in N. dimensional in minus one dimension and here I draw the the picture of my own energy this is fully normal the graph of the pull in normal what we will see is that we have so the euro is up there. We have a level that I call negative infinity Square would have then another level. Let's say negative so Easy A is not here. Zero is not here. But I want to draw it to look OK. And what happened is the following or forget the score given here. All the local minima are in this band right. Which means which is already very good news. You have this very complicated landscape. But the typical value is zero. There you don't find any local minima. So you can if you run any kind of algorithm you will go down. And you will get stuck where exactly here when the local Bynum I begin to appear and how many local minima you have exponentially many of them. And as soon as you find one you will get stuck there. But the absolute minimum is somewhere here. OK you don't find local minima above this. In fact you don't find as a many. Probably the you don't find any critical points of index of finite index you find many critical points above this. But their index diverges with N. OK. Which means what we do when you find a critical point there was a diverging or a large index. It means it has very many directions to go down. So you up to musician algorithm will find a way to go down. As when you get here you have finite index. So if you have five directions to go down and your and you have ten to the eight dimension not easy to find OK so you get stuck here and the time to go from here to there is exponential right. So that's typical of what physicists call the one. R S be phase one replica symmetry breaking phase. OK. So more interestingly in the last sentence I say here in this range as I said you have the local minimal but also the points of index one the set of points of index one of index two. But the local minima dominate exponentially. Which means if you take one of those critical points. It will be a local minimum was very high probability. OK. So that's very serious Jumma trick information right now. Above the threshold as I said. There are only critical points of diverging index. So if you take a point at level zero which is the typical value of this pull in normal then the critical point you find there you find exponentially many of them but they are all of index and over two. Which means. Directions go down have the directions go up which is pretty natural for a typical point. And when you begin to go down then you have a third of the direction that go down and two for that go up and then the fourth and three fourths until you get here when you have essentially seventeen. Direction that go down in and ten to the eight minus seventeen that go up. OK All right so below that threshold here. More importantly. There are extensive barriers to cross. That is. If you take two such local minima to go from one to the next. You have to climb a height of an urge of this of constant times court of them. But which means that the time it takes for any other of them will be terrible. OK. And finally for the mathematicians here. If you're interested in the shape. You could look at the sublevel set the region on the sphere where this function is a lot of the political is lower than the level and try to understand its topology right. So in this region that I'm describing here. You have only when the level is there you know you have all those minima and around them. You have little bowls if you want where the function is blower level. So the earlier the odor characteristic is exponentially larger in OK which seems well enough known of course there could be that I know when I adore them. I read it very easy and I say here it is that's an ember of them but so. So yes of course local I go them where you essentially just use the information of the gradient in the Haitian and what. Anyway. But you will never use the hash and it's way too costly right so you just use the gradient and even not that. So here's a theory which says the following. So if you look so this thing means the number of critical points of index K. whose. Well you is between negative infinity and the number of square eleven times you. So you look at all the critical points below level and you fix the level and you fix the index. OK So I think cake with zero that corresponds to local minima and we find that this is the sort of the function which is explicit don't read that it's painful. But the important thing is that it's positive. So we can. So this tells you that the mean there's a mean here. The mean number of critical of local minima said below a level. So that's a very detailed information is exponentially large OK. And it's this function is very explicit and with this formula you get everything I said before essentially. Not really let me explain. One more thing. This is just the first moment method. Right. I computed the mean it's not enough. Right. If you want to understand the behavior of the number of critical points. So what if you answer back as done last year. Worst for this model exactly this model to compute the computer the second moment. Believe me painful. So and he proved that the first moment competition give the right answer because the second moment is such that the variable concentrate on its mean right so that carries the information. All right. Now let me draw a picture of maybe let me draw a picture of this complexity functions. If I if I put here the variable so you have to switch from here to here. Have you turn your head. This is you now this is the energy level. OK And what you see this as zero for instance which is the complexity for minimal when cake was zero is like this. So this thing as you are is explained in the function is there it's easy to draw this graph. So what does that mean above this level which is the absolute minimum remember. The everything is square scaled by Square event forget that above this level the complexity is positive but so you have an exponentially large number of minimum. Below this level the complexity is negative. Which means that you have an expansion is small in number right so essentially you don't have any above this level this complexity stabilizes look at this complexity to the number of critical points up to level you so it's natural that increases. This is compatible with the idea that above this level. You don't have any minimal because here you've essentially saturated from found all your minimum and then it stops because that's theater zero. Now this is the die one you don't want is the same here and here it's like this so think that one. Remember is the complexity of subtle points of index one. And what you see here is that this thing that one is strictly smaller than that. This is what I was telling you the complexity of the number of subtle points is smaller than the longer a number of minimal exponentially. And then fifty two is like this which now probably explains why you call this negative even finished his numbers increase and inverse to that. OK So that's the theory. So that's what I just said there are no critical points of finite index above this value infinity. So these eks are just defined other threshold for positive complexity of critical points of index K.. And I will skip that now. So here I explain how vaguely how I got the result about easier. I'm saying this is your here this number is this number is the absolute minimum. So that's not clear right. It's clear that below this number. You won't find the probability to find a minimum is exponentially small but the minimum could be here right because this is just about the mean in fact it is here that you can prove that now. So just in case you wonder about other degrees as a sprain. This is valid for any people out of the three but not for one or two that I already discussed. So now what about none of the Jennings pulling or mules. I just described in this point. So let's take a simple case because here it is quite one picture of complexity and you shouldn't think that and what physicists call one R.S.B. The simple complexity. You shouldn't think that this is the end of the story because in fact there are other there is a typical completely different form of complexity which is called foolery speak. So let me explain what that is. So take the following simple example. It's not going to physics in machine learning take a bull in a mule of degree three and a political of degree sixteen. OK then you take a combination of the two that say this shows there's numbers. Whatever reason of normalisation. OK So you take a combination of the read three hundred sixty in what will happen if C. is close to one or two zero This thing is close to being a pull in the middle of degree a homogeneous polynomial of degree three or degree sixteen so that the lessons we learned over there should be true and they are. In the middle something else could happen and it goes in the middle when C. is not close to zero or one there is there a critical value that are explicit in fact the situation is completely different. There is still an exponentially large number of minima. But the minima do not dominate exponentially like we see here and there are no extensive energy barriers. So if you think of it. This very complex thing is good news. Very good news in physics. When you study that is bad news because it's very complicated and it takes thirty years to begin to understand anything. And we haven't yet understood the forward because symmetry thing but from a practical point of view it's very good news because it means you can get down to the bottom without crossing barriers that will stop you forever right. So this complexity is this is a complex function but but much easier to minimize than the other one. OK So Fool replica symmetry breaking for physicists here is good news. So here is predicted that the minimum should be reaching a large but finite time. OK so now in this case the minimum is not given by the threshold of complexity. The minimum is somewhere here. So there's a whole region of energy of energy is where the mean number is expansion is small or exponentially larger me but there are no point with high probability. Which means the mean the first moment. Does not describe the situation properly. And the second moment doesn't either. So that's of course what you were because I'm a true means that you can do it with moments simply OK for physics aficionados I didn't think there would have one but I know I have one. So in fact there is at least in this room in this political There is even a region for. Which is to R.S.B. right. That's new that's incredible. I didn't believe that now. In fact I know how to construct three R.S.B. and four as a B.. I thought it would never happen but it does but OK that was private conversation. OK let's forget that. Let's look at what the question is how much more time to have ten minutes. OK so what the question in German trick terms. So you consider a smooth function of many variables defined in a compact manifold in a large dimension and you want to find the minimum of Earth. So this may be a hard problem. So we measure. So we measure we define a number of critical points as I said. So assume that every critical point is not generated your function is a Morse function then it has a finite number of critical points you can begin to count them. And I call critical of F. this number of critical points of index K.. And so as I said the more theory gives concept of these numbers. For any more function. So for instance on the sphere more serious a zigzags that I mean we can inequalities of more theory. There are Germans here says it is exactly nothing more we in the most the equalities say there is at least one minimum and one maximum big news right on a sphere function. And so it says nothing. Strong Morse inequalities say more but it's whether this is say that maybe so I should explain what that is so here I call B. kill them the better numbers. Which are the logical things quantities easy to compute on the sphere for instance. And that we want an equality is just say that a crib or critical point is larger than B.K. the strong force inequality says if you take a more function you take the alternate in some of the number of critical points. It's larger that this object. Which is purely top or logical this for this side here does not depend depend on the function any function would satisfy these bounds. OK. So and in particular if you take the whole iterating some it's called the earlier characteristic. And it's equal. So let's look at this. You're not correct arresting is just they are turning some and it's a top political thing. How can it be that how remarkable that is if you take this function to be our cubicle in a little randomly each of this storm is exponentially larger then. And the turning some is the only characteristic of the sphere which is zero or two depending on the parity of the Dimension. So you have any numbers of size exponential in who's out turning some is zero. There is something there. It cannot happen. So that's what we have here. So. When I say that these bands where valid for every Morse function and here what we prove that when you take a random or function it's generally much more complex than the conference given by the most inequalities. So you can do that on. On on the sphere and a few other German trees but not generally so what. Here is the general structure that should be the one in which the theorem is that. A natural structure is to imagine that you have a remand in manifold the Gaussian process that was a sphere for us. That was a random pull in normal. If it's a Gaussian process let's think for a moment moment that it's centered mean zero. The only thing you need to describe it it's covariance will take to function of the manifold you have to describe the covariance but of these two things and you just assume that it's a function of the distance which is maybe you didn't see that in the exact in this example that's the case. So then of course this function G. has to be chosen properly the variance in this case is constant and so. The metric this object is the natural metric induced by a Gaussian function every Gaussian function defines a metric what this is saying is that it's a function of the natural metric so it's saying that the two are top a logically equivalent then in this context as suspect that we will always have what I describe in reasonable cases far from being able to prove that the few German trees where you can prove it. Are constant curvature and simple things. OK so I want to scribe the sphere because I spend too much time so here is one thing I just want to end up with one formula here. Why can we do all these computations. Because of this magic formula. So let's look at it this is the mean number of our critical points is in fact there are constants your for get them. Essentially the mean. Of a certain functional here of lambda K. where lambda K. is the case. I get value of A G O E matrix which means the for when you take a random matrix size in real symmetric the entries independent Gaussian above the diagonal of course it has a spectrum we know very many many things about the spectrum. And this relates this question of critical points to a question of random matrices. Once you've done this you happy because of course then you can work on that side and run a reverse as you can do a lot of things. So why is that true. Where is around the matrix in the model here for it nor under matrices. In fact there is one. There are many many of them. If you take a critical point critical point. Hits Hashan is a random metrics right. Maybe not a G.O.P. matrix ****. It's clearly real symmetric Gaussian matrix. Now if you condition the point to be critical and if you condition by the value of the function. Then you can compute all of these girls in Matrix and it's a geo we shifted essentially by the value. So once you are there you can do all this using another formula which is well known to probability which is called The Cat's Rice formula which relates the number of minima to directory sticks of the of the Hessian OK so that's. That's the trick doing this. All right so I won't go deeper. I will go deeper to more on that. But there is one thing that I haven't done which is how in hell is that related to machine learning and I just told you there is a random function in machine learning. Right. Where is it. So how is how is that related to physics first in fact what I've done here is spin class. I didn't say but it's just what it is is the spherical what I've described here it's called the pure piece being thus Ceric All right so in this case in hysterical case the peak will too which is the Sheraton to Patrick is trivial. You don't want to and allies are right because it could radically of a sphere. Nothing interesting on the cube it's highly non-trivial but on the sphere it's simple. So that's physics in fact even if of course here I did physics at zero temperature. I just compute try to understand the minimum. If you want to do physics you want to understand it gives measure which means you understand the function exponential minus constant times F. the measure this measure whether it's a gives measure. This is been done this Results haven't from some on that this is been done in this paper I was quoting but the two million in one Sebag which essentially prove that the Gives measure is constant. What is called the top ring that doesn't paramour the composition of the gaps measure prove that the Gives there essentially carried by so. I'm of the deepest wells in in a precise way but I won't go there. If I have five more two more minutes five more minutes. I don't know I can describe what the machine learning question is but I can also stop there. The planning on the system. OK so let's do that very quickly. Where is the random function of machine learning. And so maybe First let me go to the end of this because I don't want. Here is a is there is something said by people in machine learning in fact here yet look around of and look at that it says. So the random function is this last function and what he says here is interesting since you know the vast majority practical application you supervised learning with very deep networks in the last function is minimized using some form of stochastic gradient descent. The general shape of this last function is very poorly understood and it's an understatement. Several researchers experimenting with this as the means to get the grade in this NG has noticed that while these guys do not have many local minima where they do have many local minima and many as I said no we know is like ten to the ten to the eight. The results of the multiple experiments consistently give the very similar performance. That's surprising if you have a zillion minima Why do you end up at the same height and I explain that it's because the place where the local minima begin to appear the energy level the last level is all the same there is a concentration of measure phenomena and so you went up there and nowhere else. OK that's what. And so of course the heart you can try much harder you want to very much better. So you stop there. So where is this random function so in the interim in a minute let me describe very naive really the deep network. So let's imagine that your task is the following recognize if there is a cat on a picture on the Internet. OK so the sample at your disposal is essentially infinite or the number of pictures on the Internet is just whatever and pictures with cats. Lots of them so. So what do you do right. You so you do a multilayer network so you have no doubt which Here are the entry nodes many of them very many of them. That's my and. And here you enter. Let's say each node is a pixel of your image right. OK then you introduce then you want. Somewhere out there an outmoded which will be let's say in the example I gave the binary thing. Yes or No there's a cat or there's no cat it could be something more quantity but let's say something very barren yet there is a Got A New Kind Of course the cat story stupid but if you think of of trying to have an A car driving itself. It's not stupid. If you don't try to recognize cats. But so that's an important question. So what do you do you do multilayer network you put a certain number of network of layers. Let's say Caylee yours. And so let me draw another layer here of nodes and you have and you have them here cetera. And then you draw a graph. That's the art like it's like cooking or that's the art of the engineers. You have the guy doing this and. What distinguished distinguishes a good deal burner. About one and it's this is where mathematics could play a role if we could devise some rules to do that properly for the moment. So what do you do it right for instance you do this right. N. no. OK so you have this layered graph and same thing at the OK now you equip. Each graph each edge of the graph with the weight. It's a W.. OK And then what is the operation. Put your input you do something linear so if your input where so you have this W I J eggs J. Right. So you take for instance all these values here. Then you multiply by the way to cure and you get to this value here. OK simple then you do something which is not linear. You take a function of this one. It's a function Sigma this typically Sigma view is you. Plus the positive part. OK something like this. Sigmoid function. Whatever you want. OK So at each node here after the first thing I have a linear combination of the things down there with those weights. Then I apply a little transformation node by node and then I do the same thing. It's cetera. Until I get here. And then I get a value here. And this value will be F. a function depending on all those weights and depending on my entries X. being the entries the input. OK And what do I do then. So then I have this the w's are given. And this tells me is there or one. There is a cat of there is no cat. Then I do supervised learning which means I look at this and I say yes there's a cat or no. So you are wrong or you are right. Then what do I do I come to the median error. I produce like this which is a function of that type. It's a sum over all my sample size or made it something of that nature. Some over. Let's end on. One two capital M. capital M. being the size the number of pictures I took on the Internet. OK. To learn if there was a cap. How long my training period was how many times did I tell the network. You're right or you're wrong. And then have a function of a certain function whatever it is G. of the the true value. That's why I am and there is this predicted value F. of W. my set of all weights. And of X M That's the nature of the function. I look at this as a function of too many F's. It is L. like Los it's a function of W. all my little all my. Edges weights edge weights. Plus of course the random sampling. That's a function of course is a function of very many variable the variable being the w's. If you have very many note that's where the you say that they may have ten to the six to ten to the seven eight those. Why is that a random function and why is it a random Gaussian function of course it is a random function divided by N. and apply the central limit theorem right here you have a sum of function these guys i id sample from the universe write the distribution you don't know of course. So what is that this is essentially the mean let me call it didn't you have X. Y.. OK new is the distribution of what's happening in the universe. Plus So that's a function you don't know. Plus one over square root of M. time of Galston term got into it because G. of W. right which is the error. So now we are in the same. Situation this is what with the caviar at our function is a deterministic function that I don't know but that I expect this kind of much smoother. Process. One of our goes in terms here which is now define the space in the space of W.'s which is very large So what do they do they try to minimize this loss function minimize in terms of W. adapt the waits until the error you make is the smallest possible. They do that by stochastic rate and descent I don't even say what that is but some algorithm but now the shape of this function is important. So this is close to what we're saying but there is a slight difference what we had before was essentially this term only out of function which was centered Gaussian and centered. But now have this function personally to smooth function a nice function not a complex one. What happens. So this is also studying the reason for this is the work with tomorrow. Your only and here's what I mean in a nutshell there is a transition. Depending on the strength of this parameter. And so of course this is natural. The more you learn the more you know it's. So if this parameter is too small that essentially minimizing this thing is like minimizing this you can ignore the noise of the signal is strong. And if the signal has a few minimal you will find them easily. Now if this parameter is large which means you don't learn enough and is too small. Then the noise dominates and the picture I've just described is valid. Which means that you will find one of those minimal all over the sphere essentially anywhere in the sphere. But now when this there is a transition. Where. You can. You keep this complexity of course you don't want to learn for too long. So you. There's a point where you keep this complexity but those local minima you have an exponentially large number of them. But they are not too far from the true meaning of the true function. You still have an exponential large number of them. And this is where you get stuck. Which is much more reasonable than the former picture because in the former picture which was pure noise. You find a minimum the level at which the performance is Perth and it is the same for all of them. But those minimum are kind of unique to be done in the sphere. So here it would mean that the weights the Sinak think strength is in your brain in this brain. Or essentially chosen at random doesn't seem very reasonable. Probably the way you learned how do you where you cabled your brain when you learned how to read and when I did it may be different but probably not too far. So in this transition phase you have exactly that you have the place where you have this complexity but in fact the minima are essentially not too far from the true minimum. But also that explain it but you have the same phenomenon of concentration of the performance. OK. Everything I've said here is empirical I cannot do the analysis the spectral run the matrix the your for that if anybody wants to try please. No I'm not saying that ingested serious it's an interesting problem and maybe some people can do that. Faster. But it's but I'm completely convinced that the this is correct. Now comes the question of what kind of complexity will we find. When we find the one there is be the complexity I described for the pure cubic or the crazy thing that I didn't even describe which is called fooleries But if we find one. R S B. Then what it means is that. Essentially the performance you get is maybe seventy percent. And of the true performance and that's all you can get in finite time. So be it. There's no point in trying to do harder because it will take forever if it's fool or is to be it means you can in fact get to the absolute minimum and find it done which is much better news. But now in physics the GOT of physics was giving you the model the model here machine learning it's engineering it's not signed so God Here is the programmer the computer science person who designed this. And so maybe you could maybe in fact the good measuring learners Maybe I'm looking at is like most usual done doing you know this. Finding things which are in a full replica symmetry phase and so you find the right minimum. This is what may be distinct This is a crazy guess we're completely crazy have no clue but you know it's a possibility. The other possibility is just that whatever you do everybody will find the same thing as you so there's no point in doing better. So that's the that's the two possibilities. The proofs of all that will probably wait for a while but as I said there are things to be proven there. Thank you. Thank you thank you. And yes. So of course this is a very natural question which I. Many times but to my colleagues but the problem is when you work in this kind of dimensions. You can do essentially nothing right. So they can't you know when I told them for instance to distinguish between in one hour is beautiful. There is be there is a method if you could follow this and look at the critical points you visit on your way down. Look at their heaven and show me the Hessian I can tell you if it's full or errors be or when I was big but they can't even compute the Hessian because it's fixed way too much time. So it is the I'm a little is so they do something very very very simple to suggest a grade in this it is the simplest of all things you have a sum a function to just follow the gradient of each store. So maybe there are I'm not the person to answer that really I'm not knowledgeable enough about the performance of global algorithms their speed but for what I understood for at this point nobody sees another Global out of them that would be fast enough to do that. And of course now when you train those pictures and you ask is there a cat dancer right away and better than you do or that I do sometimes. Yes there's a talented where the cat or the gut is here and there right. So this works very well and very very fast and of course if you want to have something set of driving a car you want it to not think too much. So this is probably another possibility once you have the Jumma tree. You could think of a global Agger of them's that would be that would be fast but that's more for for an engineer to think first and then think how to do that is please. You know. And almost all. Right. And so going. So so yes it's very important question of course. So the. That's what I was saying that it was like cooking. The art of the people designing the algorithm is they have experience of design or going for different tasks and they the kind of know that it's better to do it like this or like that. So in some so they navigate the space of of models and here this piece of model is pretty soon I mean typically if you fix the graph it's typically a product of grass Mannion And I mean some that are amenable to analysis but they can completely change the shape of the graph that's so that's so that's what you're asking. So for the moment that's precisely my point from the moment. This is done like a good chef. He knows what to do. And so maybe we could go to a more professional way with it once we begin to understand for the moment when to stand not enough. So and what it's what my point about random function says is that typically it's very hard at least that's my belief and on the sphere it's proven very hard to approach to it to escape complexity. So you know. With the function I describe in the sphere essentially cover all the possible models that are described generally on the sphere. So essentially you always have experimental complexity except when you're degree wanted to. So it's rather And that's probably why they haven't found one which escapes this. So again I don't know if there is a way but for them and maybe maybe a sort of design would be fantastically efficient but for the moment it's not there. There's a question there. For the moment you can't let. Exactly why there is this thing is at least for me. Maybe you can. Sorry but I can see the in fact that this is the first kind of mathematical tool to begin to unravel the story as you probably know there is there is tension here just between statistics and. Machine learning here because you know the machine learning things the deeper meaning appeared a long time ago and then he's ninety's and then it did just didn't work. Why didn't it work because the end was not very large and this concentration a phenomenon that is measured phenomena that describe the fact that the performance was always the same level was not there. So if somebody tried an algorithm and somebody tried another run of the same algorithm they would find a completely different answer. So that didn't really entice trust right. What do you do with the same with those random things. So in the meantime of course statistics is developed very many and learning theory is close to that very strong tools kernel methods all sorts of methods that work pretty well. And until recently that that was the only way to do to do things then come these guys with this thing which And we're going to need for five years with essentially demoing is just doing demonstrations and exaggerating but maybe a few years and the performance is fantastic. Again in performance you know why Facebook came and hired all our group of machine running for them even though I tried hard to keep them was because the gain in performance was not the usual getting performance when the mathematician comes around and proved by one percent. It was twenty five percent thirty percent improvement improvement so at this point they are efficient but they are completely mathematically unjustified we have no idea why it works except what I'm. Describing here tentatively and even we don't we don't even have an engineering approach to it which would be OK We can't prove the theorem but we understand the structure and we know what to do when we put it at this point we as I entered It's more from what I feel and maybe this is unfair of my for friend who was here with us. Yell at me but the what I feel are that they. Their knowledge is really from the ground up. And so there is no basis. Once we do what what you propose then things would be much nicer because I some point these two things will work. This is statistics. This is in the end it has to be there but for the moment it's very far from there. This in my understanding again remember and prevent us not an engineer. Yes. Just because you're right but the M. is also pretty large so that's exactly what you so depends if they can go much further than the dimension they can be less that's exactly the transition I was explaining perfectly right maybe the wonder why so yeah so so so so if it is possible that the Gulf in approximation is not correct but by the way results. Yes I do. Yes I do of course I do. That's why I do that. Of course. So in Spring last theory. This has been proven for other countries here I'm doing spin last year at temperatures zero a positive temperature for questions like the free energy this universality that you are asking what happens if the interactions on a. Fissions I'm a Gulshan But whatever else. This is been proven so the parity picture. At least in some context is correct even when the G.I. J's are the coupling constants are not Gulshan But that's that has been proven for lots of the free energy for the new you. Maybe not know you need you needed the point are inequality for the distribution and you need some conditions but but but not not the extreme ones like this is exactly. Yes yes yes we don't understand anything anyone is saying anything yet in the case. So give me a break of a few years and then and then you are sure of chatter G. and he will prove the universality. So thank.