Thank you very much for the very kind introduction and thanks. It's always a pleasure to be here and talk about things. So what I'm going to talk about today is. Two algorithms that we developed recently in our group and both are related to both under the umbrella of optimization for machine learning. Feel free to interrupt me at any point of time ask questions if you have any comments please let me know. Feel free to stop me at any time if anything is not clear. OK So let's start with some really generic problem that we consider a lot in machine learning which is that of binary classifications you're given a bunch of data points. There you have labels for these data points so let's say for instance this could be something like pictures of people and is the label could be something like this is pictures of men. This is pictures of women or this could be things like emails and then the labels could be this is a spam and this is not a spammer and so on and so forth. So there are many number of applications where you get data like this and what you're interested in is to find a decision boundary which can separate the data points from zero which have a plus. Label from the points which have a minus level you know in this particular case it is reasonably says straightforward to see that there is a linear separating hyperplane which can separate these two data points. And if you want to ask. OK How good is this data how good is it is a given linear decision boundary. You can look at what is called a margin the margin is nothing but you look at the distance of the closest point from the two glasses to your decision boundary. I've been meaning you dimly The idea is that if you if your points are very far away from your decision boundary sort of it's a good boundary and if they're very close to each other then it's a bad boundary. Then I'm a little bit of geometry calculation will immediately tell you that you can compute the margin as to are the norm of this. Wait wait there or they are the normal to the hyperplane that you produced. So you can use this inclusion to set. An optimization problem which sort of looks like this there's a little bit of added complexity here in terms of the size that counts for some data points which may not exactly be classified so you may have some points which are misclassifying and basically are trying to maximize the margin subject of the constraint that all the points lie on the right side of the hyperbole. Because of this is the. This is the optimization problem that underlies linear support for commissions get. We'll come back to this problem. In just about a minute but let's talk about what happens if your data is not linearly separable into a simple I give you a set of points which look like this. And clearly you cannot separate them by using a linear hyperplane but there is a simple nonlinear hyper limit separates the set of points. And what you can in this case do is you can take your data point and transform them to another space where they become lean potentially they become linearly separable lets say in this case you can dance on them to the space where one coordinate is makes another going to take square plus Y. Square and you can clearly see they get linearly separable here. So if it comes back to this optimization problem that we had before. So you can take your data points up like some mapping function feed to them and know you in this new space see you want to find a linear separating hyperplane. It turns out that if your fear specially is very high dimensional For instance it is one hundred one day years do use the larger ones do you own and using the larger ones do all you can come up with or do about demise ation problem which sort of looks like this could be things a lot about does your optimization problem is that all it depends on each dot product so you know products between these mapped data instances are data points we get and that is a well known break which is called the kernel today which basically says that all dot products can be replaced by this function which is called the kernel. It's sort of a black box where evaluating a Connal function is the same as taking Dr blocks in this map feature space can. And. This kind of an optimization problem is what underlies a non-linear support with a machine if you have heard about them. And there are very efficient algorithms there in particular there's one very efficient algorithm called as a model go to them which can optimize an object a function which looks like this. OK Now when I sort of start teaching this thing to my class. All right. I explain it to somebody knew who hasn't seen this kind of stuff before the very first question they ask is this is all cool you can transform the data points into some high dimensional space and potentially you can come up with a linear process for this but sort of it's going to use. How do we know what kind of what is that I cannot use in this case. Can then go out for about seven years or so people have have looked at this problem and they're covered sort of a framework called the multiple going to learning framework which is simply to say that you take your data point. And you cook up as what I call base kernel so sort of depending on different aspects of domain knowledge that you have about your problem you cook up these dark products. OK you can get as many of these dot products as you can and then what you do is you combine these dots for us that you cooked up by using really a linear combination a non-linear linear combination with non-negative rates. All of these all of these different dog products to sort of customize this going all of the similarity measure for your particular problem. OK now this is going to further this kind of a thing appears in many different places and let me want to read this with one particular this kind of a problem appears in many different places many different forms and let me more degraded by one simple example of one that I that I frequently. This is something that the person that I collaborate with at Microsoft Research is very interested in. OK So here's the problem is that of object detection. So you're given some features and saying this preacher do you want to put some bonding boxes and say inside this yellow box there is a person inside. Redbox there is a bike and maybe the inside this you know. Box that has some sort of a sign and so on and so forth. OK Now the thing is you're given a large number of these kind of images and you want to sort of try and find these bounding boxes efficiently. Now as you can imagine in a inner picture a person can occur at any level of granularity so you know a person may be somewhere in the background in which case they'll be a very small picture of a person maybe a person is in the foreground in which case they'll be a much larger you know covered by the person and so on and so forth. So in this case one natural way to approach this problem is to say let's construct features at many different levels in their hair all key off levels on this object to capture different granularity is not just then because you are given you missed data you could have things like you can extract features along many different dimensions for instance you can extract things like sift features you can extract you can do some edge detection. But how do you go do some sort of chip detection and so on and so forth and all these are different features that you can use and also have different levels of granularity. So now you have this huge number of feature directors which can be used to construct similarity metric so you get a lot of corn and now you want to try and combine them and it turns out that this approach of combining these kernels actually. Some of the best results on those pasta with your logic challenge. And where you can see there are different creatures and these are some of the example. You know classified. Some of the example bonding boxes that you can find so in an application like this or similar applications exist by informatics and so on. The problem is that you get a large number of data points. Or at least. You know a few thousands at least. And you may also be able to cook up a large number of kernels. So what you want is some sort of an algorithm which can ski. A little bell both along the number of data points in which can also scale along the number of times of this is what is the holy grail for seven years more or less people have tried and struggled to use the the very efficient as some or algorithm that they know for training support Victor machines for solving this problem. OK and no mess there. They haven't been able to do that. Recently the managed to do it and I just want to give you a snapshot of the results before I dive into the details so that you you sort of get a picture of what is coming. So this is a service in some experiments with related only small data set up about eight hundred going to those using the features of this dataset and Shogun algorithm which is which was the state of the art before we actually. Published our work and you can see we are barred. Same time six to ten times faster than children not just that these select few were kernels so we actually we actually out of the set of going over us we select the more meaningful ones and a smaller subset of them which is also very important because you don't want to how all possible kernel some kind of may not be good for your task and you just want to suppress them. So we select few our kernels We're about six or ten times faster on the read dataset which is about fifty thousand points and with about fifty Karnos we can bring in thirty minutes. And on this one our data said the same the dataset as before. If you give us one hundred thousand kernels. OK we can and you pre-computed them and feed it to an algorithm we can train in about eight minutes and if you want us to compute the kernels on the fly. We can bring in about thirty minutes. All this on stock standard hundred basically runs on my laptop and. We took at a series of RB of Cornell's with different routes and we combined all of them together. So how do we do this. So let's set up the optimization problem so you can think of it as you are given a bunch of colonels and they have a corresponding set of feature maps and what you're interested in is to find a common combine feature map. Which gives different ways to each one of these individual feature maps. You can immediately infer that once your fee has this kind of block structure your weight vector also starts having the same kind of block structure. And you can plug this thing into your into the familiar optimization problem from before. And you get an optimization problem which sort of looks like this. The changes are highlighted for you in red. OK so if you stare at this problem for just about a minute. You'll realize one very serious flaw with the very descent of the fly is that if you give me any non-negative rates. I can choose my DK do we basically go to infinity and I can always start this fight this construct. So it is clear that this is sort of you know you you need some way of constraining that is you need some way of saying that these should not shoot off the Infinity. So they may or do that is you add some form for regularize or so you use out of here. I'm using a pin on regular eyes or more generally you can use a brakeman diversions. If you wander around interrupting. And in some formulation is especially like the one that children uses. They instead of adding a regularize are to the object a function the added as a constraint. But basically they are equal and you can show that there is a value of zero for which if you add this as a constraint. You'll get the same optimization problem. So basically they're more or less equal. But they can't do as they do how to somehow constrain your ds and the very constrained it is by using the P. not OK now I was again this problem this optimization problem still has one more problem which is that it is not jointly can mix in B. and W.. So you're doing right evils of should you shouldn't. So if you do a variable substitution this is the overall optimal. Problem that you get can now again this optimization problem right. Most of the people would start doing at this point is that there would be your lies the problem. OK So when you have as this problem there. Do a partial what is called a partial deal. Which means they would emanate the W.B. and the side out of this problem. OK And then they get some sort of an optimization problem which looks like this. More than the equations Let me show you a picture which best explains this. So what you get is a is a surface which looks like a saddle point. Which looks like a saddle. So you want to maximize with respect to your first and you want to minimize with respect to the values and what you're looking for is a subtle point like a point in the center. Does Of course when there is a hard problem and usually the way people are some of it isn't and every iteration they'll keep the difficulty and try to find the optimal Alpha. Now if you stated this problem and B. is fixed and you're limited to be out of the problem. This becomes a stock standard support Rector machine problem. So you can just use your favorite as a lawyer and solve this problem then add the next or Gratian they would give the alpha fixed and find the optimal again. Usually if you look at it again. The problem in being is usually a simple problem and you can solve it more or less analytically and then there keep on repeating this. Now the problem with this of course is that you had to make multiple calls into a support with a machine solver. Usually something like hundred or one fifty something like that before you get conversions. Well as if you look at all the papers in the field for the last seven years. They're all been sort of let us tweak this problem. Let us try to make things a little bit better. Let us try to use the subtle point in a different way. Maybe let us not solve the problem. Dr malady more as this is what people have been doing. When they started looking at this problem especially inspired by some of these this. You know object detection type result. This problem. We said this is not going to work. I mean we can go on and brain this support with the machines on these large collections. So what you need is there's a key insight which are which we've figured out which is to say why do you do this partial dualism. Why don't we do it as the problem entirely and actually if you're doing is the problem entirely you have a problem very eliminated Levy out of the picture completely. So you get an optimization problem which is entail a mouthful. But there is a dual connection from which given an alibi you can recover the dead so you haven't lost anything. You still can't get your devalues out but you are given an alpha. Now all you need to do is find that optimal alpha if you look at this problem. The constrains are exactly the same as a support machine this part is exactly the same as a support worker machine. The only difference is this time. So it's not quite a quadratic programming problem but it is very very close it has exactly the same structure as a support activation problem. OK Now once you realize that you can come up with a very simple algorithm which is basically there somewhere. I'll go to them which says every duration you choose to or area will solve AI and Alpha data optimize you need to choose to because you have a some Asian constraint. So you choose a need to ID of those then you get there. Optimization problem it's a one dimensional optimization problem and you can solve this one dimensional through my vision problem efficiently and you keep repeating and below conversion. To complete the description of this algorithm. All I need to do is tell you two pieces of information first as how do I select the Alpha as an alpha G.'s at every iteration. So what we do is we do agree the selection and the greatest selection is based on the direction of the very term the directional has seen so basically you did you try to predict. Which pair of points will give you the least reduction will give you the maximal reduction in the objective function value at any given iteration. And you select that. And the solid ideas problem. It turns out that when peer in E.Q. are equal to true. This is a one dimensional Karthik you can look up in Wikipedia and there is a formula for how to compute the roots of a one dimensional quality. So you get an analytic solution. Are when you have other values of P.V. use and you don't often which is a very efficient algorithm for. That in this case already has seen information is already available. So it's a very cheap algorithm as well. Now let me show you first let me assure you that sort of having the piano in the constrains sources in the objective function doesn't change anything in terms of generalization performance. So here are some small data sets. This is the Australian dataset and a game across a range of people use I'm plotting our generalization our test accuracy and what you're going to show going to. Is the solver that I told you which uses in the constraint and you can see more or less almost exactly the same kind of the illustration performance so you haven't lost anything here. Same thing on some other data set. But when you look at the actual scaling behavior. You really so this is a this is a data set called the our data set. It has one hundred twenty three dimensions and about thirty three thousand points and this is a dataset which has a natural split with there are nine different splits of this data set. Starting from sixteen hundred points up to thirty three thousand points. And you can see that as the number of training. Examples include those of course the the time that you need increases and you. But if you compare against the time that is needed for sure. Going. We're about five to ten times faster and this is the same story almost all data sets. And even better at some point around like less than one thousand or so they that point. Sugar and gives up and. And it cannot go on anymore. And pretty much the state of the art is you can do about ten thousand points with about five or ten kernels. That's pretty much it but it turns out we can just keep on happily chugging along. Even now for larger points. So what I'm showing here is I am taking the time that that we take and we just sort of regulation and you can see the price that you pay as you go along as you add more and more data points is something like quadratic in the number of points but again this is exactly the price that you would pay if you were running a support like the machine so if you take a support to machine a simple production machine and add more and more points you pick whether that price. That's exactly the kind of price that you're paying here. Similar kind of behavior. Again on another day does search on this day. Does it for some strange reason Shogun wouldn't even brand I mean even on the smallest split it wouldn't run and we're looking at the developers of you're going to find out why that is because they're. And this is a fun experiment that we're doing which is basically we wanted to see how does that algorithm scale with the number of data points. OK so we started with about ten thousand kernels. There's no other algorithm which can handle these many number of kernels so brilliant there's no comparison. And then we kept on adding more and more kernels you know an increment of ten thousand up to about one hundred thousand at which point we got bored and gave up. OK but pretty much you can see that in less than about half an hour. We can bring one hundred thousand come here and if you plug this curve and look at how is the scaling as a function of the number of criminals. You're paying a linear price. So as you increase the number of kernels. You're only paying a price which is linear in the number of kernels. So which means that basically if you want to combine say five kernels on any data said you basically end up paying five times the cost of running and I swam once on this data set and it was just to be expected because the dominant cost of solving it as we have with us. In computing the kernels. And you at least have to compute these five kernels at least once. So sort of you pay the same price here. And this is another large dataset. This is I guess some or potentially one of the highest dimensional data sets on which I have seen multiple kind of learning result there may be others but at least this is a reasonably high dimensional data set. This is a real time data set with about seventy three thousand points and about twenty one thousand dimensions and again you see this scaling behavior. And of course I forgot to add. So I'm only showing you for particular values of P. and C. but the story is more or less the same for all like So we're fairly independent the scaling is fairly independent of the of the P. value so for many different people use. We tried between one one point zero one and two three and four all of them pretty much. This is the same kind of behavior that we get so you still get this linear behavior here. So before I move on questions you. Because this is the Lagrangian you will. So you are so the decider point problem is sort of a pain because you are doing a partial dualism when you do a full you will you still get a convex problem. So when you fully realize the problem. So there is no approximation of any sort. You're exact everything is exact OK there's an answer question. You are so so this part of the constant in practice because when you choose. P. to be like one point one or one point or one for all practical purposes it is exactly the same as solving with say and I added one penalty which would infer sparsity. And again I don't have. I mean I. Not showing results here but if you use a break and I was like entropy you still get sparsity in that case also. But I doubt the code does run a bit slower because there are exponentials in Moreland record. And. So are you are you thinking of like a block block penality like blocks of kernels and then doing doing. Are you know actually that yeah that's something that we would love we would like to consider. So the two things that we have on my plate is one is to consider an elastic net regularization and the other one is to consider a block structure we have some ideas on the block structure but they are not fully worked out yet. I mean I can talk to you in private. About them. When they do so. It is it has been shown that actually using multiple kernel learning gives you a statistically significant boost over just using any individual kernel. I think they use something like thousand but again some of it may also be I think you may be you may know a lot more about this than. Yeah. You know it. Yeah yeah. And my understanding is that sort of really some of this is also limited I mean there. If you read some of these papers which apply multiple kind Learning to the Pascal object. Challenge For instance there in order to complicated because they say well we really cannot solve this full problem so what we want is we jump through these three five different hoops and then eventually you know we try to sort of try to reduce it to a small problem which we can handle and then let us try to handle that. So I think a part of the question is we don't know yet because we haven't. You know we haven't had the algorithms yet to support this kind of thing and now that we have this is one of the things that we are trying to. Yes. You know even for people skills to there is no analytic solution once you haue more than two and in the other cases anyway you don't have an eyelid and a tick solution so you do have to take a step but in practice it turns out that just one or maybe in some rare cases like two Newton drafts and steps are needed but it isn't a lot of I mean it's really really fast. And know all we need to guarantee convergence is that you make sufficient progress in every whenever you choose. Whenever you take a sub problem. You sort of at least have decided by something like an army of your condition with and on with some parameters. So that's all you need for other questions. Yeah yeah. Or. There are many different routes and then you're to select the writers we'll talk about it. We'll talk about it in person. OK So this story so far is that if you look at the kind of algorithms the Muslims successful in tackling me too small to mid-size problems especially if you how small number of dimensions going to really are a lot of value to this because you know when you do this non-linear transform but then the issues that when the number of training examples increases the number of dual variables increases and so sort of you know of when the number of Alpha coefficients optimized becomes really large camp so in this case however it turns out that there are many different datasets which of course growing in size and in these and in some of these applications for instance an N.L.P. or are other such applications. It turns out that actually having commerce doesn't seem to help because their data already is so high dimensional So if you have a bag of words or presentation of a document. They did. There's already so high dimensional that really I think Connell's on top of it and mapping it into a he went higher dimensional space more or less does not help you with anything. So what you want is really optimizers which can deal with these large amounts of data and which do not necessarily. You know use the kernel take so basically I'm talking about solving everything in the prime of life in this domain sort of I want to give you a summary of the results again which will read dive into the technical details. On the reps pilots from program data search which has about three hundred thousand examples in the training set and fifty thousand examples of the test set and about sixteen million by mentioning I can train a linear I swim in ten minutes on a single day and about eighty seconds in a course. The same on the up side and a desired which is about four hundred thousand data points and two. It has two thousand I mentions this is a very dense data set and again I can bring in the nearest him in about four minutes. So when I said play millionaire as we are in four minutes what I mean is I can get there with a like one percent or so of the final I.Q. to see in about this time. And I can. All I can but I can optimize for the R O C eight young which is a performance measure in some cases that you want to use and you can show you some examples of how you can do this last function that directly optimize the Odyssey idea. And you can solve this in about seven minutes and on the K. Dick up they decide which is about nineteen million data points with about thirty million. There mentions I can finally nearest him in five minutes and out of thirty in about an hour. And of course these all these algorithms that I'm talking about also can be paralyzed so they can actually run in battle. And I'm sure the sort of paralyzation that we're doing is is still reasonably straightforward and I mean there may be experts out here who may know a lot more about these kind of problem and maybe have some ideas on how to make them. You know make them more more efficient. So given that at best the challenge of having these large citizens and trying to learn classifiers on them. There is a very general framework that you can use which is the framework of regular versus minimization so. In this framework. You can think of you have some sort of a last function which measures how well your predictions are doing on your dataset. And then you have a regularized that would sort of says that you want to get good. You want to get some simple classifiers. So that you don't over fifty your data. OK here I'm showing you a particular instance of a medical risk which is opening by taking an average of the last function over your data points and this last function actually turns hard to be very very closely related to the. Last formulation that we saw before. But. Keep in mind that you can how other different kinds of loss functions and one of the interesting sort of last functions that you can get that you are into that you may be interested in is not run with just doesn't average or were your entire dataset. But we sort of is alas that depends on your entire data set. It's not like a one know what I'm all for summation. I call such laws functions decomposable you started as a last function which is like a function of all your impaired training set and your entire you know all and Intel Abel's And so these kind of losses are called non decomposable kids but from their immense for this talk. I will stick to these decomposable losses but show you a few results on what you can do with non decomposable last one. So the main challenge in trying to solve these kind of problems is that the interesting ones that you can set up more or less have a point at which they are not differentiable So for instance the flaws that I showed you in the previous life. This huge loss. Which is what I'm plotting here is not differentiable at this location. You said this is the key challenge and then you have to solve a non smooth optimization problem. And of course announcement optimization is more you know is more expensive can't so what happens is that the object a function of course is non smooth and if you look at the State of the art algorithms actually some of which we developed ourselves. It turns out that you can prove that in certain situations that require at least one or lambda have to learn iterations. Not sort of if you if you think or think about this when they when they are absent when is already a bad behavior but you can more or less tolerate it but this one I have lamb mainly in the question in this. Binder sort of disturbing because as data increases what you want is you want your regulars the risk to be becoming more and more important and sort of you want to regularize or to shrink in importance. So which means that as you get more and more data sort of you should tune your lambda down because you don't want to regularize or overwhelm your data but then that means that as you get more and more data lambda goes down. You're bound goes up. That's a problem that almost every algorithm suffers including sarcastic algorithms and all. Those that we developed and also we proved these bones recently showing that they are actually good will require this one. No longer feel an iteration and the other problem is that sometimes they are really not all of them. Do not work. They're inherently the way they are designed they cannot work in a distributer setting yet. And so all if you cannot made your data in a single car in a single machine then there is just no way that you can you can optimize where you can run these algorithms. So the question is can we do better can. We can idea of what we are trying to do is basically inspired by a series of papers that nester off road which is to say take a non non Smoot optimally optimization problem and replace the object to buy a suit object to run but of course you want a smoke objective to be reasonably close to your non-food object. Right. And then there is a panel meter there's a tunable parameter which sort of tells you how close or how far away. This approximation can be. And so how do we apply this. So the question is that how do we apply this indeed in the in the regular S. with minimization setting and. What we want to do is we want to do this sort of a small thing in a principled way what I mean by principle really is that it is not just something that over here is one last function let me apply to this kind of loss function and then here is another one. Let me come up with a smoothing for this one. We went something sort of like a higher higher level recipe and so if I come up with the last function or if I come up with a with a problem sort of you can say Here is a higher level recipe sort of all the equations all the math has been worked out for you all you need is sort of follow this recipe one by one by one and then you'll get a smoothing and that's moving and can be used. And you will get a smooth object a function that you can minimize by using of our two told you so that is what we want to do and very importantly we also want to do the same kind of analysis or the same kind of treatment for decomposable losses worse as not composable losses. So if a loss is decomposable it turns hard. You can. There are many number of different ways you can do things for instance you can you know they're there they're number of algorithms which can handle that. But if you're losses non decomposable there are very few algorithms which can handle these kind of complicated losses but what we want is we want to do those kind of problems. Also here in the same sort of setting because a how do we do that we start a let's say we start again with our rigorous minimization problem. And we replace the empirical risk by a function. Which is the start of Musa D. is standard whenever I use a star in my signs are in my slides it's basically differential you all for a convex function and then they do all of the function. Is denoted by a star mill was sort of a tunable parameters or likes playing it to you in a minute and a the sort of what you need to do is to find out the G. and A can so that me explain to you what what we want. So what the very rendered designed is just our feeling is such that it is uniformly our proxy mating the empirical risk. So what I mean by a uniform approximation. Is that if I draw a mute you around my empirical risk sort of you know just think of it as just placing a mute tube around the risk my G.'s star must fall inside the tube as a sort of you want. Approximation which is tied to male infinity no good at the same time I also wonder do you start of no suit but I don't require a little bit more than smoothness it's not quite price differential believe it. It's a little bit less than price difference or ability. I mean that it's a big. Radiant is continuous can't suppose I can find such a G.'s star just just assume for a minute that I can find such a G.'s start new then there is a pyramids are going to essentially do the mistrust which says that you can come up there exists an optimization algorithm which will converge in these many iterations. So here you see the dependence on Absalom is square root of the square root lambda Absalom and if Landers too small then you fall back to be one zero Absalom rate and so which means that this square this square will be times the normal fare is sort of a normal fare depends on. On the on the norms that you use for the spaces of G. in Moore which I'll explain in a minute and these are constant which only depends on your approximating function and is independent of what a prolonged lambda OK so what's the high level recipe now for constructing this kind of the start of new so how can I go ahead and construct a function which has all these desirable properties so basically I want this function to to uniformly approximate my empirical risk. So the very do this is that sort of the high level recipe that you need to give me a function a complex. So you need to devise a convex function G. and A matrix clip which is usually the matrix is a function of the data that you have so is that the empirical risk can be written as the functional you will of G. applied to a transpose W. can't the next step. You need to do is you need to choose extra what is called a strongly convex functions. If you know comics analysis a strongly contacts function is one where if you subtract the called quadratic out of a convex function the function still remains convex if you don't know that i'm just want to keep some preacher in mind. Just think of this as some generalization of a quadratic function. So it's sort of a generalized correct function that says that and how quadratic it is is the modelers of stronger makes it clear and the function must have must have a bound on the domain that you're considering it should be bounded can't one define the function. You just said you're just are immune to be G. plus new D.M. and you realize it and it turns out you can prove that this function will satisfy everything that you need. Now there are a few pitfalls that you need to be careful about while designing this kind of a scheme of course you can come up with. There is a choice of G. and A matrices that you can function and you made this is that you can come up with but one thing that you need to be very much aware of is that when you choose your D. A Your D. function your D.N.A. It should be you should ensure that this quantity the square root of the times the normal fare is independent of the training set because if it is if it is dependent on the paintings that size. Basically you know you can't do much because you're you. You know you as the number of training points increases your iteration down increases which is a salute really not what you want. And the other thing is that you should be careful is to ensure that the gradient of the start of new must be easy to compute sort of what I mean by easy to compute. Is that if it takes a certain amount of time to compute the subgrade of empirical risk it should take about the same amount of time to compute a gradient of this sort of new because I mean you don't want that you reduce the number of iterations by the cheat ration becomes more complicated. So those are the two things that you need to be careful about let me show you are just examples for binary his laws and if you're interested in the other extensions. Then please come and talk to me. I mean we have some we have some meters oldster. So they implement this recipe. So what you do is your star so for the for the binary his loss. You start with a function and your favorite is basically linear inside the Q The Inside the Box the end the dimension of the box and is infinity everywhere else and you choose in a matrix which is basically you stack all your data points one by one multiply them each with the labels and and scale them well. And if you do a little bit of math. You can convince yourself that the empirical risk can be written as the start of a transposed. Now if you look at the choice of day I want to show you two different choices of B. that you can make can so you can choose or you can just so this D. is often called a proc function and there are two different proc functions that you can choose. So the first proc function that I'm showing here is you can choose a quadratic norm. And one of the minus properties of children of these kind of separable problems is that this approximation function this a function that approximates empirical this is also separable so this mean that you can write it as a summation or some functions and this is what the ugly function looks like but more let's look at a picture which is probably speaks better than hundreds of equations. So this is the approximation by tuning your Mew. You can make it closer or far away from your hands last and sort of it is linear in this region. It is linear in this region and between these two regions. There is a quadratic on intercalation. So this is all the function looks like on the other hand you can also choose to use an entropy as your proc function. OK if you use up all in public. Proc function then again your G. Star is separable and the function that you get looks like this. Let me relate this function a little bit so that it may become more familiar to some of you. If you can have this constant for a minute. Basically this is like a log of a constant times exposed minus you scale version of minus you. So this is nothing but a lot just take some something like a lot just take a loss but with a different offset and there is a sharpening parameter mewe that allows you to make the loss sharp or or lose as you want and again if you look at the approximation. You can see that you get an approximation of the hinge loss and it looks more or less or be here is more or less like a largest It's just a scaled version of the largest. Then I have the same recipe can be applied to other domains it just becomes a lot more technical. So I won't bore you with the details but just to give you an idea we can we can use it to optimize multiclass problems. We can use it for solving the Aral Sea area. So very You want to you want to directly optimize and maximize the area under the outer circle or if you're interested in directly optimizing and getting a classifier which has a good pretty recalled breakeven point. So these are these multivalued performance course that you can apply and. In the in the day these are functions which depend on the entire data sets of friends in these two functions. They do not decompose over the dataset. But still you can apply the same recipe and you can get. You can get the same kind of smoothing So how does this thing actually work. So here is a data set to develop spam trigram data center. So it has about a dozen or two hundred thousand examples but about sixty million features and I am just showing you four for a particular value of lambda. And this is how the objective function evolves and for this particular data said we couldn't get any of the others all worse to actually even load the data. So the date the date as it is is I think something like thirty gigabytes or something and they wouldn't even load the data. Most of the other soldiers couldn't even loaded it or well we can happily do that. And on top of that you can see. This one is on on a machine with thirty gigs of RAM but then I also show you where we do. Multi-core multi multi processor where you don't have that much room here. And here you can see in about something like five hundred seconds or so we are very very close to the final test accuracy. And on this particular day. Does it actually does happen I mean there is only a very minor improvement but there is an improvement when you set a small value of lambda and you have to remember one thing that these are all related really clean data sets which means that sort of you. You can more or less do very very well on these datasets and serve of the algo there are there are good ones that will compare against can do extremely well in cases with clean datasets. And now when we run it on our own or on an eight processors. Again you can see we don't get Of course we don't get eight times improvement because of on the because of the communication cost. But again in our two hundred seconds or so we're pretty much done. And the and this is how the objective function the world when things happen so you can basically the upshot is you can distribute this across multiple processors. This is something that is very important because when I whenever I talk about large scale datasets. People often ask me or why don't you just run stochastic radian descend on this one but still get a gradient descent is not easy to paralyze I cannot run it in parallel and where we can run this in Palo. Here is on another data set. This is their talent data set. So this worse. The data said that was one of the challenge data sets in the past two thousand and eight large scale as me I'm training example in a training challenge and this one has about four hundred thousand examples and each example of these two thousand I mention was a totally dense dataset. And I'm showing you all the sort of the evolution of the object a function as a function of for our optimizer with linear so live linear is a very impressive piece of software and it's a very impressive algorithm. Actually the underlying idea of Linnaean was published in two different places. So we published it in two thousand and seven and CYLIN discovered or rediscovered it completely independently and of course you know and he did a much better job at implementing it and. Making there are three. Here So what happens is that when you do coordinate descending we do all it is the same as doing stochastic grade in descending the primal because you're selecting one variable at a time. Now many are blessed or cursed agreed in dissent. One of the major issues that you have to tackle is how do you tune the stepsize So if you use something like Pegasus for instance it does not converge very well on most of the practical problems that we tried because the tuning the step size is a huge problem. But what happens is that you can get rid of the stepsize tuning by doing something called an implicit update. So an implicit update in the primal is exactly like ordinary descent in the deal where if you if you maximize the if you maximize do you will progress at every direction is exactly like doing an implicit updating the primer. So we publish things in the primal and see the disk and we sort of talked about it you know in one of appendices we sort of follow this is also equal and to do in the will progress. And they have a better software. I must admit that. So. And then later in the I.C. A male friend sees a lame talked about his work we realize that they're pretty much connected. And again because this is doing a dual card in a descent linear cannot inherently cannot be paralyzed. That is there's no way to do the politicians. So here again this is on their cell and they does it. And one thing that may seem strange is why is the why is the object a function of linear jumping up and down the reason for this is because it said you will solve or it does not OP demise the primal or object or functions it's a dual solver so there is no guarantee that they knew that the primal will go down more atomically but we are a primal solver and we can guarantee that the primal goes down monotonically. And again here you can see in terms of accuracy. Again we are about three hundred seconds or so we're pretty much there by linear at about two thousand seconds are still struggling to get this. Here's another data set which is the key to take up did a search. In this one has about ninety million examples and thirty million dimensions approximately here is something very strange heavens. It turns out that our farmer takes a long time to bring the objective function down a linear actually is very fast at bringing the objective function down but if you go to the generalization performance the story sort of completely different always had one iteration which takes about I think two hundred fifty seconds or something like that. We have reached pretty much the final test accuracy where linear is still taking time and it's not very far behind. As I said it's still a very impressive piece of software but it's a very impressive algorithm but still it's a little bit slower than us. Now this is a this is an example of a decomposable loss which is the art of Syria I just want to show you that we can actually handle. These non decomposable losses. So here I am come. Pairing against him which is a bundle method solo so abundant there's always are pretty much the only thing that you can do right now if you have a non decomposable Non's mood last. And they're pretty much the state of the art and that was actually a small world that we developed it is all for this particular problem if you know about as B M R ministry in power for equivalent. And you can see that again because B. amount of them bills are piecewise linear lower bound to the object a function. It is not guaranteed to always monotonically decrease the object or function value. It takes a long time to even get to somewhere anywhere reasonable in the meanwhile we're basically in them for breakfast. And here again I'm showing you the tester. This is the theory on the test set and you can see that we are pretty much in about four or five iterations of passes through the dataset you're vision about the final arc yes. This is the same kind of story on the K. D. cup. Actually I had to admit the B M R M runs have not yet finished. So there is just this this results were printed eight in the morning today for the amount and probably still running on the cluster. But. As you can see if you have already converged pretty much there. So our algorithm can hold the amount I was sort of still running when when I got this result. And again you can see within about within about two thousand seconds or two thousand five hundred seconds. I have the turnstiles the area of to be optimal. And this is on another data set. This is a precision recall breakeven point and this is on the cover type data third which is a moderately small dataset Final thousand examples with fifty four dimensions. And this is how the objective function evolution for B M R M looks like and this is how we do the same story again and one. Where you with the details and sort of two to wrap up. I want to briefly spend a minute or so talking about things that I did not talk about in this talk. So you can actually use a brakeman I wouldn't or an interview regularization And I'm Carol. So in the details for that. Also didn't I briefly mention him as an answer to one of the questions about the convergence of S M M K L that you can through conversions of the algorithm. What we don't have yet is a rate that would be really good if we could get the rates that something we'd like to look at. And also we want to. And also I didn't talk about the extensions to offer smoothing techniques to things like our O.C.D. are purely B.P. in detail. I just showed you some experiments but I didn't talk about the details they just get too messy technically but pretty much the same high level recipe holds and I want to end with a note of caution that designing specialized optimizers for machine learning problems really is an important area it is that you really you know know a lot about the object or function. And you you you must spend time in understanding your problem to understand what sort of objective function you're minimizing before you actually launch into throwing it at an optimizer if you do that then you get you can scale to larger and more interesting problems but at the same time I also want to say this. That small thing is not the answer to every problem. So if you have a wonderous problem for instance this start this tickle properties of your estimated will be completely lost if you smooth you will get and have someone I cared for. Lucian. But sort of the statistical properties are completely lost. So you have to be careful smoothing is not the Hansard every machine learning problem but the ones that it is an answer to it seems to be an extremely good. Yeah because the generalization capability of so sort of this study sickle properties and the generalization ability of and I wonder problem depend on it being sparse. Once you move you lose the sparsity which means you don't get the same behavior. And there are the the S M O M K louder them up here that nips the record is available for download from. From next homepage. The moving room that I talked about has been submitted as a mission the US. Record is coming really soon. And I hope it's not really research really soon but hopefully it'll come in make a week or so once we clean up the our scripts and we'll put them up on. And I'd like to acknowledge my collaborators and co-authors from Microsoft Research India. Johnson and our students are pretty universally and. The work on smoothing was done with John Paul who was my Ph D. student and who's now a post-doc at University of Alberta. And with. Who is a Ph D. student at University of Chicago and sort of his defacto advisor and that's all. And if you have any questions please. Thanks. Chip. That's exactly why you need to solve for two. Because. YOU'RE SO you for. It's all other variables so. So you start with the feasible said yeah yeah yeah. So you freeze everything else and you do only but you always maintain the quality constraint you always so in your entire solution pan you maintain the quality constraint. And it is very closely related to block out in a descent. I mean it is it is by and large inspired by by that as some or as by and large inspired by that and we are inspired by. Other questions. Thank you.