My undergraduate degree in Ph D. in physics and my worst grade in physics I want tell you what it was but was in the experimental physics class and experimental physics is a very different thing than theoretical physics and whatever and to oversimplify I sort of didn't do so well in that class because I basically failed the final and roughly the reason I failed the final is that there is a machine you have there's a hole in the wall the powers that machine and you supposed to put the two together and I didn't get that and so so I tried to argue and I tried to say well I had everything else right and the teacher said basically the first if you learn nothing else you should know to check the input and check the output so so it's good to know that I haven't. Lost a game that skill yet so OK so I think the microphones are working and living a tripods after it which is foundations of data. They do you need to actually do worry about what you put in the wall because need to compute stuff but foundations I mean I guess the question is should you be totally ignorant about that or should you be doing the foundations in the context of. Applications in implementations and so on and so what that might be useful so is a couple ways to think about what I'll be talking about here today but that should be sort of in the back of your mind. So second order machine learning. So I don't know I still want some feedback on so I'm not a marketing guy i do math and foundations is this a good marketing you know so they should change the name so does this sound like second class machine learning you're not so good or does this sound like you know higher order and super fancy machine learning so let's give me feedback after the talk but the way to think of this is you know tell expansion this is zero thought of there's a first order and second order so you know in not just an informal way and so we're going to be doing more fancy and more sophisticated types of machine learning the boil down to those sorts of optimization methods and this actually I think taps right into the middle of the tripod. Efforts of the tripods effort you heard about in N.S.F. call and Georgia Tech we mean a bunch of other places. Are setting up institutes to address foundations of data in the ideas you want to pick some stuff from computer science and some stuff from statistics and some stuff from Applied Math and distill out the stuff that sort of most useful for data and that's actually a little bit tricky because a lot of people in foundations people try and form like the theory in a way that sort of largely divorced from the data and that's very different than say what scientific computing has done over the years where they really want to do simulations and make very particular downstream claims and so. Scientific computing and related areas think it's just obvious you know you should be doing second order methods and machine learning and other areas just think it's obvious you should be doing first order methods and whenever I see this sort of situation where you have two different groups of people think two diametrically opposed things that a body as it's engineering space to model into because they're both sides by have something to offer but but you know that I'm missing something and rather than getting into some weeds about this level or that level of detail about what methods are better you know keep in the back of your mind what's going on here right so this is phenomena of big data and large data and so on and so if you're developing methods and foundations What are the problems now that are different then what low perception machine learners want to do what scientific computing people wanted what drove applied math the theory of albums in the past and so that's what's going to be going on in the background here so. I don't have a pointer which is probably OK because I have two things to point to so machine learning sort of inverse problem is the inverse problem of a long history in applied math you have you have some for process and you want to back out the inverse process I'll be talking about two things some first order methods so first order means you know you roughly roll downhill if you do some gradient descent direction I'm going to more detail about that these oftentimes don't do particularly well but they're very easy to implement and think about and they're very. Tack lots of knobs on tool and so lots of knobs means that you can easily fit lots of problems and if you have lots of data maybe find a dataset we could do better so the question is in sort of model space is that a good way to be working cook clearly it may be a good way if you want to get people just up and running and working with something. That will talk about some and I'll talk about a particular thing that combines knobs in an interesting way. Then you can say well I want to work with a narrower class of methods that is in some sense more expensive but is Kwok class of algorithms more narrow but but is richer and better in that more narrow class and so I'll talk about second order methods. And is a naive way to implement this in the naive way is not good and then there's. A range of non naive things to do and so if what you know about second order methods is that you have to invert a has C N. That's not the case that's analogous to saying if you want to solve linear equations you need to convert compute inverse So Page one of a linear algebra book says X. equals be the solution is X. is equal to a inverse B. page two through five hundred says why you don't do that is that a huge number of other ways to solve that problem without committing the inverse simile here right so big data so the elephant in the room has big data massive data so we started M.D.'s in modern massive data sets I think is two thousand and six and a couple years later it became big data in the data science and so on so there's a marketing and terminology question but sort of the elephant the room is that you know every unit every department in this university thinks data science has something to do with them right sociologists and economists and historians not to mention you know scientists and engineers and so on and so this is sort of a very big societal trend right where people are having to rest with wrestle with this and instantiates itself in different ways and so the tripods thing that it was alluded to. You know says how can we pull out. Foundations and theory from that so you know genetics and brains and a range of other areas generate lots of data now we're going to try and take it to the next level so you need you monkeys data so again feedback on the marketing but you know humongous data so you know how do we view this and this is sort of it in my experience if you ask people. What's going on what is data science what is machine when people get very dogmatic and often have a sort of a territorial issue but you know it's a classic example of groping at the elephant and seeing what you're trained to see so you know you might be a scientific computing person which has well you know I need a bigger he said need a bigger machine or a distributed data system that is needed because you G.'s I'm a statistician I had to posit a model or a machine learner Also I tack on regularization you know and I'll want to or all three or all four and you know iterate. Your albums person you know wow it's big and for I need fast algorithms recently this data scientists who just say geez it's a mass I just need to clean this mess up and everything else is easy so if we get one step beyond this so I'll be talking about today is first and second order optimization methods I'll say why we are but this which is that in the back of your mind that all of these groups are coming at this question about how to optimize and how to compute things and you know the question is why and can we get better methods and better foundations and so on so. There's a lot of people think about these things in different ways and I think the sort of leading axis of variance the leading order bit in terms of the way people think about it should be characterized in this way. This lives about five or six years old so I think you know there is a valving so maybe hopefully slightly less true now than five or six years ago but still I think you can either think about these things from what for lack of a better term all calls that a computer science perspective which you know is a sort of perspective you probably adopted if you were trained in computer science certainly before you know ten years or so and their data are just a record of everything there is there all the clicks at Wal-Mart. All the sales from some some store and the goal is to find interesting patterns a sort of beard diaper paradox or you know if you're familiar with this and finding these patterns is a needle in a haystack sort of stuff and it's it's typically intractable and models of data access are appropriate for the data and so you design a fast approximation algorithms and the form of those algorithms is on this data sitting in front of me I either compute the exact answer or I'm absolutely good and I'm faster than the exact thing would would take the computer so there's no world out there this is the data this is all there is maybe you're going to use this for that but this is what there is sort of a very different view of the data and one that I think essentially everyone else in the world adopts it's formalized in statistics departments and so on but they're also hold it up so that the data are just a particular random instantiation of a noisy process and the reason there are so useful because I want to learn something about the world the data are just a middleman but the only insight I have in the world and so the goals not anything on the data structure inside about the world and to do that I chew on the data and so I posit a model or I do whatever and I try on the date and I optimize something whatever and I want to make something about the world so these are two very very different things. And in particular at the sort of foundations level statistics and natural sciences scientific computing problems involve computation with a set of computation per se I think a secondary I think. They don't in particular it only makes sense I mean these problems out there and it only makes sense to think of a subset of problems problems that are well posed and well posed means a solution exists. And it's robust to its unique and its robust. Probations of the input so the solution exists it's robust to you know probations of the input. You might say you know why would I compute anything else and the answer is you know algorithms computer science computes on lots and lots of problems that are not well posed in that sense so it only makes sense to develop problems first write down a model and think about. And later and you may have parameters like condition numbers that characterize the posing as a problem that enters into the analysis of the algorithm computer science is very different so the numerical analysis was around in the early days of computer science but that's largely pushed out in terms of the range of other computational X. departments you know computational physics or chemistry or you know what scientific computing and in C.S. it's easier to study computation per se if that's what you're seeing as an NGO in a discreet setting you can hide solution to halting problems in real numbers and funny stuff like that so it makes sense to consider the discrete problem. And there you're talking about Turing machines and complexity classes and so on but in particular efficient algorithms are about functional transformations they're not about data right and you don't restrict it to only well post problems you're making statements about these transformations and so you really the third on divorce is the discussion of efficiency from from data per se and as may manifest itself in posing this question So first on a fast algorithm and then later see if it makes sense so that's a general level and before I get into these two problems you may say you know why if I care about doing a better job predicting puppy dogs on the Internet or finding quasars or hearing your cancer why is any of this relevant and as you know in machine learning a lot of problems are formulated as optimization problems and this is also convenient because you say I write down a implicitly or explicitly and inferential task claim about the world and for a subset of those I can essentially reduce the to an optimization problem so an example of this is you write down the usual thing V.C. theory and you reduce it to an optimization problem and any details about the problem enter via one parameter say. That reduction is convenient but you know it's you know the what if you look under the hood and don't have a well defined optimization problem oftentimes you can make better downstream claims so we're going to talk about the optimization question today and I'll describe some theory in this in that in that country. Which is that in the back of your mind that I have a few slides at the end about you why this might be a thing more generally for these questions I think this is a nice hides and out of the study's Foundational Questions such as the theory of algorithms people applied mathematicians and so on scientific computing people range of people on a range of different methodological spectrums of ask these questions or sort of asking the same questions but in light of the very different forcing functions of sort of modern large scale data so we're going to have two types of problems one is going to be let's call it composite optimization problems minimize a function that can be written as after H.. F. is nice and H. is fairly nice but it might be not smooth and the canonical example of this is an L. one regulation so might be smooth here but it's not smooth everywhere. It could be an indicator of of constraints at so it's not smooth right at the boundary there. Too is a finite some problem in some of functions each of which is sort of nice. Or not but but for you know that each of which is nice. In the universe of optimization this is a small subset but this. Is of interest one because F. can be a model and two as it has a capacity controlling regularize or ridge so whatever the second one is to the canonical example of you know empirical risk minimisation or whatever where I have a loss function and I my overall penalty is a sum over all the data points that last and so that's the second bomb. So we want so a lot of classical optimization think go to optimization will go to scientific computing is you know is. Is is if let's say efficient but effective but inefficient so. This to get was going to two questions here one is going to be essentially how many iterations do you need to solve these are going to be iterative elements and not sort of one off you know solve some of the typical iterative and so the question is going to be what's the cost per. Aeration. And the second question will be how many iterations do I need so we're going to do two things here and we're going to deal with both of them. So in some sense a lot of these and I mention scientific computing has a if you go one step beyond vanilla stochastic gradient descent or methods where you really want to be very exhaustive to where the data came from or you have very very weak knowledge about the domain and you want to do better in sort of a very sort of narrow sense you need to go beyond very low precision computation so randomized linear algebra gets low precision approximations unless you couple of the negative method for example you need to go beyond that and that's been studied for years in a different context and so there's a rich area there and we were talking about this ten years ago and about five years ago they became a killer application that made this very explicit this these deep neural networks and deep learning and so you know there you had a situation where you had. Machine learning algorithms that were numerically intensive not data Tencent before that the hard part is getting data in and now it does run over and then the thing will work and this is a nice you know that's not quite true but that was the sort of general consensus and they have a nice killer application which shows that that's sort of nakedly false and so if you want to compete in that area on the applied side you need to do better on predicting puppy dogs and whatever the particular C five ten is but if you want to use this as an a forcing function to say they're at a very different point in the foundational space and can you pull out what's going on interesting there so that's what we're talking about. So I want to run an optimization algorithm. Some of these things we talk about convex some of the not convex and all more to say about that later if you're going to run an iterative optimization algorithm we typically want to minimize something. And I should say if you think of this from the point of discrete optimization the way those things are sometimes solve there's certain ones that have a very common a flow like algorithms to make take use of that. Structure think of this more generally although in recent years some of those implicitly doing it are of algorithms also under the hood so the simplest thing you can think about is if I want to minimize something is that a roll down hill and the way you instantiate this is because I have a function value approximate that function with a local linear approximation I get a gradient and sort of roll downhill so that's a gradient descent so very easy to do you may need a lot a lot a lot of steps so that each step is very quick but you may need a lot of steps. And since you may need so many steps you know you can say can I improve it and there's a lot of ways to improve it this is a dizzying array of ways to improve it because you can take the last two steps on average and you can take this you can take that you can combine things but if I do stochastic grade into a I don't I don't roll down hill by get us to cast a guestimate which my bad size with my learning rate there's just dozens of knobs here and so you get all sorts of variance and so this is sort of the state of the art what's going on in machine learning in the first order methods and this is me. So what's going on first order methods and sort of the the model you should have for these things is that you have this thing on the bottom so you can think of this thing as a circle stretched out and some funny norm that's fine but you don't know what that Norm is so think of this thing as an Aleppo it may not be X. is a line in some high dimension place and if you're at some point the downhill directions here but the minimum is there and if only I could know that I want to move downhill when I say downhill I mean the best downhill the gradient you know maybe I should move downhill but in that direction basically the reason is that as you know you don't have to want to move downhill if you can make a much larger step in the gradient direction if you take a much larger step and some other direction. And so Excel aerated methods if you know what those are take two steps of these and try to get some to cancel based on incidents of a dual space and so on and so this which you should have in the back of your mind. So this is why people do it and it avoids over putting in certain senses. So challenges with these sort of simple so you just roll downhill and I showed you the code snippet you know anyone can the nose Python ever can put that on a line of code it's it will always get something even if that something is nonsensical and then you can you know but it you know it'll work in the sense that syntactically it'll work second order methods take a bit more understanding what's going on so there's a higher Barrett entry but they're much richer in some senses. So firstly the methods are very sensitive to conditioning at that aspect ratio of that is not ideal. You've got very serious problems now stated that way some people want to say just machine learning all you need is low perception why you worry about any conditioning issues just stick in some regular eyes and all work you know sometimes funny things going on under the hood if you don't like that look at the killer application there's this phenomena called exploding gradients exploding gradients is basically the difference between one to the fiftieth power and point nine nine to the fiftieth power is a lot right point nine to the fiftieth power zero one point zero one of the fiftieth power is infinite one of the fiftieth powers one so if you're a little bit off in a stack a bunch of things together this blows up in your face and so they call it exploding gradients and then you try and patch that up if you come out from the perspective of what's going on in terms the optimization it's just obviously of a condition number somewhere it's not parameterize the way it is classically. And that you can have this sort of problem. And so there's some overfitting issues that I'll get back to all right. The other thing is there's lots of hyper parameters and you have no idea at all what the hyper predator should be I mean even order of magnitude so if you have five hyper parameters and you don't know what they should be choose five orders of magnitude it's this is obscene overfitting problem and so this gobs and gobs of people in Silicon Valley are paid gobs of dollars to fiddle with parameters and so that may be the best solution here but it seems like in that whether you want to claim here is that if you understand a bit more about the structure of the problem you can get rid of a whole bunch of those hyper parameters and do a better job out of the box and sold mention that towards the end all right so remedy so. So simple first order method accelerated adaptable get back to the mean by that or second order methods so there's two ways to think about second order methods one is I have a point have a line put some quadratic thing there that's that's a perfectly legitimate way I did want to think about second order and that's true that sometimes there's some subtleties that arise in second order methods that aren't immediately obvious from that perspective and they arise basically because the other way to think about second order methods is just a way to find a zero of a function nothing that often they just finally is there of a function. So if you go to numerical analysis book Newton's method is find me the zero function applied optimization the zero is you know look at the second derivative that has seen zero and you know what not and so apply it there but there's a few little gotchas you got to be careful of second are methods and it's because they don't roll down hill they find zeros and so and so that's a got to the to think about how can you certify that you're at a minimum or at a maximum or at a saddle point or anything so that's And this is a bug or a feature depends on the application but something to keep in mind so we're going to I'm going to talk about two things one is a not so simple first order methods get the best of both worlds in terms of knob combining flag and flair. For the standard composite optimization second order methods will be sort of stochastic a new type don't convert your Hessen compute approximate has seen something I'm not going to do I have seen three math that are L B L B F C S are the sort of things like that and apply it to Newton just you know a convex but also non-convex will trust region cubic regularization some of the things. So there's a bunch of projects here and a bunch of. People involved with various things in bold as Fred rooster he actually has moved to Queensland and so he's sort of been in the post up with me in that a great job and. Really sort of led the charge on all of these things and he's of involved every one of these and so other people are in and out but he's been so the Townsend's isn't a great job with a lot of these things. And so he was sort of lead. There for these Ok so some gradient methods this is of the problem we're solving F. plus H.. So that a bunch of equations here these slides are on line and I don't want to go through all the details of the equations if you're familiar some of the stuff the structural sim familiar if you're not you know look after if you care enough not also to point out the relevant things but the point is for sub gradient which is like a gradient except it's a sub but think of it as the same you compute G. which is a sub gradient in line three in line for X. K. plus one the next step is an argument of G. dotting the Solomon is an alpha parameter so this is sort of the optimization problem that that solved directly so you need to get a sub gradient and F. G. d's to casting method do that sort of easily you need to choose a step size this is a knob to fiddle with it will get back to that. To sort of limiting cases are I want the step size to be the constant or I want to go to zero in some way. Logistic Regression a standard example fits into this a lot of other examples been to set up a practical question you know if you have infrequent features we're going to hear about term document modeling this some terms that appear in lots of documents but lots of terms that appear in very very few you can get small partial derivatives if you're familiar with the natural language not state of the art now but even in the past it T.F. idea normalization you normalize things a very frequent terms don't mess everything up so there's a little bit analogous to that. And so the idea might be here the very infrequent features are highly predictive. If I were a word that's unique a fairly uncommon that might be highly predictive but very frequent features like the word and are not so predictive. And so the way that appears when you put it in optimization is that have large parcel derivatives and so for those directions maybe this learning rate primer you should adjust down and for infrequent features of small parts that it does you should increase up so now we had a learning rate others. And I'm telling you that well you know you need to have different rates on different features as even more knobs here so you replace Alpha K. which is a number in fact a sequence of numbers with a scaling matrix and our mass prop Adam there's a whole bunch of things that follow this general approach the. Same algorithm compute G. form a scaling matrix ass and solve the same sort of problem except if a wet S. wedged in between S. could be anything but if it's a diagonal if it's a sufficient roughly as the previous algorithm. And so this type of guarantees you get without a grad would be F. as my objective and it's better than root to dimension D. infinity alpha over rooty. As opposed to a sub gradient descent method where you're better than D. two over routine so let's unpack this there's a constant that is now about that. Alpha is some number D. is your dimension lowercase D. and D. infinity is a max over infinity Norm's or a max over two norms of the size of the set you're dealing with the other constraints set. So you know these are in comparable ones not going to be better than the other they both have the one over root to you but but you know the dependence on dimension on the geometry of the set in other things is quite different so the competitive factor the D two and infinity capture sort of the geometry of the such a trying to optimize over an alpha depends on something having to do with its computer bull but you know something having easily computer would do something having to do with. The grades that you've seen as you run along there variability let's say. So what we've talked about there is one way to start to de-couple the geometry from other parameters that are going on and that's what I want to have a method that combines the best of both worlds and there we can start to get a handle of the geometry of the dimension but we said nothing about the number of steps a different type of knob and the whole the line of literature says I want to improve the tedium and dependence I don't want to but that's I just want to prove that. Dependence And so let's let the same set up and I want some great methods and sub great methods of give me one over rooty there's something called stuff which is a bit better and it gives you the one over tea. And then the F. on fist is fastest and so it gives you want over tea squared and so the idea that I want to get a certain quality square root of whatever one thousand and versus one over a thousand versus one over a thousand squared is different and so the bottom ones are going to be faster and you can easily sort of convert this if you're more comfortable with not at this. Rate but in terms of epsilons and you get an epsilon upsilon Square. So here you are talking about improving the T. depends but the way to think about is this first order methods and once a G.'s we have these problems so tack on a bunch of parameters to control learning rates the other says G.'s e have these problems and do a stand fist and have a bunch of parameters to try and accelerate things the way second order methods would look. So Excel array to methods do that they get better learning rates adaptive methods do that they get better constants that are depend on the geometry what's going on sort of those are constants or not it depends on what you think dimension is a constant Can you get the best of both worlds and so just an inch of time not going to too much detail but the short answer is going to be yes the flag and player so. Day and night leg and save your life so. We have the the solution for you so flag is going to be a fast linearly couple the depth of gradient method and player is going to be a better version of that word for last and certainly So there's a range of ways to think about these accelerated methods the original approach of Nestor all of us had a very algebraic And I think most people thought it wasn't so intuitive this but people recently have tried to relate it to continuous time partial differential equations one of the ones I like the best was Allan's one direct Yeah I had sort of a way to think they call that. A I forgot the name but you combine information in the in the primal in the dual in a certain way and so we're using exactly their linear the couple techniques they use information in these problem we do and couple them in a certain way finding the parameter with with a bisection search will get back to that but we're going to adopt their approach if you happen to know that so flag OK so now you see there's a basic thing that almost everyone can implement and when that doesn't work you take a few lines and it doesn't work attack if you're the one that has or could put in a few more lines so now the code getting a little bit more complicated. Why K. is step three why K. plus one is some approx step you get a gradient mapping in the next step you form an ass Step seven Z. K. plus one that is of the same form as the problem we saw earlier little bit. If you would call a routine you can call the same routine with that and then step eight you do you do some linear coupling with approx is the usual proxy valuation which generalizes the projection you know Matrix idea and convex at say so this is flag so flag is good you know by certain standards. And so the standard is do you want to evaluate yourself as an M.L. person or a theoretical computer science person optimization person but by one of those metrics flag is good for the bird's eye view of legacy do the usual gradient you form a gradient history of scale something when you combine something. So it's a very particular way to combine it but it combines it with the goal of getting the best of both worlds. So the theory is. Flag will give you dimension the Infinity time some fudge factor beta squared and Pfister will be D. two you know the T. squared just as D. two divided So the convergence right there both started to look second order so they're both good so I don't know if this is better or worse because we've introduced some knobs you've removed some knobs so let's let's look at that there's a competitive factor the analog of what you had before. This is one depends on the geometry and the beta dependent a slightly different on the gradient history. So the linearly couple and so so if you're interested just an iteration complexity and one way that machine learners try and prime A tries the complexity of an algorithm is just the can the iteration count not how expensive is each iteration is assume each iteration is cheap the number of iterations so here you need to couple the two things with an epsilon binary search so the Allen joiner probably appeared Fox or struck I don't know exactly where it appeared but the sort of ANY if that's something they didn't worry about we're going to ask a question about that find an epsilon approximation to the root of a certain equation. How long does that take most log want to over a second last bullet steps using bisection and at most two plus log one of the ups and proxy valuations perforations so is a lot of going on here what's expensive in the whole set up because if I write a paper and I say I'm faster than you I implemented your code applied into a particular good job implementing it and put them into my code I was very careful to a good job implementing and so I'm better than you. So that. Doesn't seem like the best of things right so a way that is common not in statistics not in album but in fact in scientific computing is to evaluate the complexity of album by the most expensive step the number of matrix vector multiplies let's do the same thing here with the number of products of valuations nothing else matters that's the dominant cost everything else is second order no pun intended I guess. And so we're going to do a linear coupling flagged as a certain thing and it takes some number of steps. And so in particular to proxy valuations more than a fist so now everything we talked about before is interesting at a reasonably high level of theoretical granularity now we're asking you know something of a finer scale question. So. At most two so most two prox of I was more than pissed is this a lot or a little I don't know how you implement and tell me right so it turns out to be a lot and so now the question is can we. Not reduce this the M.L. problem to an optimization problem and then reduce this to a bunch of knobs that then we could combine let's look under the hood of those proxy valuations and what we're actually doing right we're doing a linear couple of different things and if we look under the hood here we can do one of two things if we ask that of a worst case question we can just say I want to guarantee that I'm going to be good on the next step that cost something like what we just talked about or I could do something like It's not exactly something like an iterative refinement do something minute early corrective now in worst case you don't have control of how many iterations this takes the answer is it takes two and you know you can find something worth worse but. You are in fact you know correcting it so flare has if you do that than player. Well you do that it or of correction well have exactly the same number approx of valuation is Fista So the dominant cost we've decreased down that too but it's going to similar theoretical guarantees to flag so best of both worlds compared to festa I'm going to gloss over these empirical results not because I want to spend time just talking about that stuff but I want to go in a little bit more detail of that's on purpose else we have at the end there are a lot more for looking but suffice it to say that the claims I was making are you know can be borne out and we did it you know fairly nice job evaluating that. So this is interesting because it's a point in parameter space we can make first order methods look a little like second order methods and we have a good understanding what's going on with the Jam which are the constraint sets vs these various knobs so now can we say can we do with just a lot fewer knobs and the reason I wanted with a lot fewer knobs is in some sense if you have you know a million algorithms and you have a million data points the cross product you're bound to find a positive result write this in the way set assertions try to control this with false discovery rates in the analysis question here what if you have gobs of knobs you bound to find some premise setting where it works but it's not clear that that works in some cross validation sending but it's also not clear how to put that into a cross validation setting because this isn't a model it's an algorithm and the album's implicitly implementing a model of whatever it implements but because it doesn't fit cleanly into the usual cross validation settings that you might think of in statistics. So let's shift gears now we're going to look at minimizing the finite some problem and. We're going to get a couple things going to look at sort of stochastic Newton type method second order type methods. It applies to whatever but think of it as convex and it's going to be the usual theory if things are convex and such and such a way then I can introduce a guess the city of the second order math and good things happen then I'll talk about trust region and cubic regularization. So the formal second order methods is if you're convex and you choose all your things properly you'll iterate you'll find a solution. What if you don't know what they are convex lots of examples in scientific computing are that way neural networks are not obviously that way in a range of other things non-negative least squares a range of other things and mages completion what topic modeling lots of things are convex not convex is a big place right in lots of stuff can happen and so from this perspective what we want to ask Is the are we at a minimum or not or are we stuck at a saddle point and if you look at a plot you know and this is your brain in a calculus book you know how can you not know right you just look but if you had a million dimensions it can be hard to answer that question and so would trust reading cubic regularization is either at a cubic term which can find us at the center action or you have a region which you trust and solve a related problem so that when I get value problem so this is a hello world and applying these to cast a second order methods. We can do the same thing with flag inflatables that have taken our methods in non-convex pipelines. So second order method so second order use gradients and has since I think I think it has it just sort of mentally just much more complex objects for people I mean such as part of what's going to Berkeley this is to stay to science division and so on I'm involved in teaching sort of a freshman level class maybe some cycle of a class on mathematics of data and that involves some linear algebra probability and optimization and the way it's usually taught as you do calculus and then this little bit of multivariable and then curls and divergence is because that's what engineers care about and then later you might learn about linear algebra but in this stuff you know I want to start with the linear algebra Lialda says these intuitions you have about are two all generalized are million but in fact are a million behaves in very funny ways it's totally. Different than our two and our three the low dimensional and you can understand that in terms of measure concentration and some coin flipping ideas right so so I think if you thought of that way to be clear but the way to the top I would have seen that is conceptually much more complicated objects which need to have some idea of what a matrix is or the birds used a lot of subscript chasing a move to fast convergence rates just the structure of the algorithm the thing you're doing itself is resilient to build conditioning they can over fit. In the sense that if I want to machine precision I'll get machine precision but they overfit in a very structured way you know it's not that you have a million knobs and a whole braces they were fit a very structured way which in the space of sort of models which you would albums implement might be a good thing. But if you do it in general but certainly naive the perforation cost is high so could get around that. Let me skip that because we do want to talk about the slides at the end so second order method so deterministically approximating second order information cheaply large body of work you know read a notice that on write read and there's a bunch of books on this there's a lot of great stuff quasi Newton method B. F.C.S. the limited memory be F.G.S. and so on. When we did randomized linear algebra so I had the good fortune to bump and arrive at the tell on the my dissertation and moved in this area and they were working on random sampling algorithms and it seemed that the particular album themselves weren't going to be competitive when rubber hits the road but the ideas are very different they could be in fast forward some number of years and the answer is yes you know the best blend pick can be tell a packet can use randomized linear algebra to beat L. A PACK walk time and so then responds as we can be better then there's an arms race but you know you're going in the arms race you're not even you know you're not factors of poly and Morris so can you do the same sort of thing here so we want to do is randomly approximate second order that information. Do we care about machine. Do we care about conditioning how do we care about it. The answer that question is you know I'm going to punt on because I'm going to care about it in so far it matters for the foundations of data we're doing this tripods Institute. Should I worry about the same thing. Was wary about when they were developing the I Tripoli you know floating point standard. Is that stuff useless now are some nuggets of it useful or not I mean that's a. Long long hard question and long answer but we're going to worry about these things in so far as they are useful for machine learning and data analysis applications these days that are broadly defined. If you do that then you should ask yourself we're going to randomly approximate something where is this randomness so if you ask people where is randomness a statistician would say well you know randomness is in the data they have a model and the model is you know why is equal to X. plus noise in noise or whatever so it's not random in the algorithms rends in the model and then you call it an album as a black box if you ask of the computer scientists where the randomness is the gold but there is all there is there is nothing else so there's going to randomness it's got to be in the algorithm to choose on the data and what we saw in randomized linear algebra although it was harder to see because that sort of structure place and what you see here also is that the randomness in the algorithm considered Gys well at the randoms in the data and can implement regularization implicitly so I'm not going to talk about that because that's a finer scale thing but that's as you should be things were going to be randomly approximating second order information and you know we may not want to solve the problem we say we want to solve because we want to solve this implicitly regularize problem and so how does that map to these conditions and bridges and so on so. Well subsample to has and has been a bunch of results in recent years on that Martin's did it a while back actually in the neural nets application that very he was well ahead of the curve and it's one of these things that sort of those just the stuff that came out but that's actually nice results if you're not familiar with the sketching the hacienda. Subsampling has scenes and gradients a range of things here. So I'll be talking about some of these and the idea here is that we want to solve something and. The idea is they want to find that minimum point and we're going to run an editor of algorithm and you know if you have a rough estimate if you get the exact rating the exact test C.N.N. your quadratic it takes one step and if you're not quadratic iterate at whatever rate. What if I get a noisy estimate of the gradient but if it's an extremely good Very little bit of noise what if it's extremely bad but marginally better than random So this is the tradeoff point how what's the size you mean a batch How do you control variance and so I want to show that we get similar guarantees that you would in the global analysis case in the deterministic case with randomness so. We can subsample has scenes and gradients. Let's consider the convex case you know the usual set up we want to design methods for machine learning that are not as ideal as new methods but have the properties first of all the turn towards the right directions not here but there they have the right length the step size of one is going to work most of the time and it should scale up in some meaningful way so we want to scale up so we're going to sample gradients and has in the size of S. has to be independent of N. the number of data points or dimensions features turn in the right directions turning the right directions means don't go downhill but move towards the minimum in the house instance that's of the natural structure if you second order information and so we want has been to preserve the spectrum of approximation to preserve the spectrum of the hacienda as much as possible and that's boils down to subspace and better ideas and from randomized linear algebra from the with that not ideal but close to going to get a fast local convergence rate close to that of what the much more expensive note would be and the right step length you don't have these decaying step lengths and all these other hyper parameters step size one work so maybe it works after initial burn so I want to scale up return in the right directions the theory is going to be. You know if the sampling size ass is larger than something then you know probably one minus Delta has C N approximation as with one plus epsilon. So if you're familiar with this that equations the obvious thing and if you're not familiar with that has the has the N. H. is the approximation Del squared up as the has the N. and there with an epsilon of each other and the rest is dealt as an absolute So we get a good approximation of the Hacienda. Fast local convergence rate. So. If you stop the sub size to be one with high probability from one step to the minimum you have a linear in a quadratic term and this is a very nice property you can trade want to get stop the other were not as you know the primer on the linear term is a problem independent so I made arbitrarily small and so. You get what's called Q linear convergence and you can also get super linear rates as well for certain in certain settings so it converge is the best way the way it should be. Right step length unit size eventually works. Uniform subsampling this is a time when they gloss over this this will imply that the right stuff lines for that eventually work. And we can do it with an inexact update let me be. Local and global so if you're familiar with the usual convergence properties. If we choose the sampling of the gradients and or Heston's the right way it'll converge if you're an expert in optimization there's local versus global convergence and so there's some subtleties that this is watching over but but that's the first order bit and so now it all points out very nicely any optimization algorithm for to the unit step length works it should have some wisdom right it's you not giving us all bunches of knobs you're getting at a very clean way with it so in a sense it's the right way to overfit And if we're going to do early stopping and want to characterize an implicit regulation or over stopping and sort of the right quadratic well. So these are all convex Now let's deal with non-convex city. In the last whatever five minutes or something so. Points local minima local maxima and things get sort of very wild. Non-convex problems you know what does that mean right. It's a big space right so just saying something's not convex the saying here is convex and separate thing else so it's not a strong statement and so you've got to do is carve out bits and pieces that are causing convex or in vacs or have some of the structural property. So that's going to be looking at. And the way the you want some notional let's call it epsilon G. epsilon H. optimality where the gradient is small less than epsilon G. In the end the lambda min of the has the ends you want all of these to be positive and so maybe want to slightly negative. So to deal with this this trust region methods trust region method say basically don't choose a direction and then do a line search instead choose how far you're going to go and then where exactly to go and it's not on a line it's in a space so they're actually very different than line search methods if you know lines and methods and cubic regularization methods they basically look at the next order term in the Taylor expansion and if you near a saddle point it help you find a descent direction. So to get iteration complexity this is a lot of technical thing to get ration complexity previous work required number one where you know you have scenes are proximate in the right direction with a quadratic squared thing there this is stronger than the dentist more condition if you're familiar with us from from optimization you can relax one to two with a has the INS are good it's that without that square who cares right this seems like a wee level thing number one is a much much more convenient thing to do the analysis is much easier a lot of people work with something like that number two is actually a lot harder so we have fifty pages of math here to justify this but quasi Newton method satisfy two and not one but in particular sketching subsampling methods from randomized learning how to satisfy two and not one so we were able to operationalize this in a much stronger way because we satisfy condition two and not one and so we have this set up and. Let me be. Gloss over that and say that. Both Let me just say both for trust region and for the Cubic regularization would get results that are either the same or slightly worse than what you get with the just the classical deterministic methods in terms of I'm finding the right direction in terms of getting out of saddle points and finding the local minima. So. All right. Good math sounds nice. Here are some preliminary results the preliminary results of the end of the slide because we actually have a a bunch of other results that are here that motivated some of these questions and so now we can say given classical truss region classical cubic regulation class of convex theory the leading order bit as we can reproduce a bunch of that with stochastic first and second order methods we did to the first as she does gobs of work on that we can take the best of both worlds but for second order methods. The running time these are very practical things this is not a theoretical claim there's a bunch of but these are very practical things in the sense of factors of two in the number approximately missions. Who cares why why would one want to do a second order optimization of machine learning applications in some cases you have built conditioning. When you have built conditioning if you do a first order method you're purple. You can run two hundred fifty I could make the two hundred fifty thousand and still be up there first of the methods you know by the time you get to fifty you're black. Or red depending on details of how many of the sample sizes you might be green green is Newton green or find it but it's more expensive so if you're better black that's better so if you're doing the problem that is ill conditioned This is not a neural network with exploding gradients but that's an example of one that is you do better and if your scientific imputing person look at that and say yeah obviously but you do OK. My machine learning problem is that it isn't ill conditioned Why should I care fair question it's a. Good generalization parameter so the way that this works and people fiddling with not to lob is you get a lot of convergence properties a very practical neural network albums that are just qualitatively inconsistent with what convex theory would suggest. And so this allows us well that's a good generalization and robust type of parameter tuning so the thing on the top in a bunch of these other things are first. Or methods that perform very poorly or full and out second order methods almost out of the box. Will perform well and you'll be blunt you'll be down on the black in the red in the blue and the other ones you had to beat on to try and find the right values the hyper parameters in even this very simple set up ability to understand and escape subtle points phenomena you see a lot is you get a little bit better and just flatten out you're on forever and then boom you get better I mean why I change one little primer some work gets better chance for someone else don't get better may or may not you know so. A lot of that has to do with understanding the structure of basins and saddle points and and here you know this particular one the yellows and oranges are momentum based methods you go down to flatten out you run forever this particular set up the initial burning was a little bit slower but then you hit the basin but then you're able to find the corner case work out better and avoid that if you're interested in distributed applications you can implement these things with low communication they're the communications a bottleneck not the computation and you can implement that and distributed settings you know we have that we've done randomized linear Alderman terabytes of data not this but you know I this will go through to that and so you can plan to deceive the settings if you want to put it on G.P. use and argue about factors of two and running time and so on you can put on G.P.S. This is unpublished but we have this out so this a bunch of preliminary results we have at the end and I think I was here I don't know when it was maybe five years ago so plan back here before then but whether it's Than or before that I can tell you about the how the preliminary results panned out but so right now there are a lot of preliminary results but I think it's what we're setting up the framework to do a bunch of first and second order optimization methods analogous to we do the randomized linear algebra at the heart of so using these that is the heart of the foundations of data discipline saw That's a very practical downstream problems so with that let me. Wrap up and I'm the leading act so stay around for the next hour. Thank.