Really smart guy and good taste in coming to universities. And. Any unique thing about him not only does he prove lots and lots of hard theorems but also really wants to implement things as well. And so he's really trying to solve these problems holistically. About experimental results but is a bit OK So digits. Thanks so much for the invitation to come speak. This is it's nice to meet you folks here see that one about the fast lab and things like that and. I hope I'll tell you something. This is going to be mainly theory but I have a little bit of kind of quasi experimental stuff and these are things that least the first part of it and can and have been implemented so believe it or not that's you know there's some some level of. Effort that has been you know it is possible to implement them. So that's that's something I'll talk about optimization for various classical machine learning problems and throw in a little bit about some thoughts related to those to the some of the ideas in those problems that has to do with nearest neighbor searching and the SO bilinear optimization work which is the main part of the talk today is joint work with a lot isn't and David Woodruff who evidently was here yesterday talking about. Something else having that kind of vaguely in this in this space. So so yes. First I'll talk about sibling or optimization uses a technique which we which is very simple but which we dignify with the term L two sampling. Then I'll talk about a bit about nearest neighbor search trying to speed it up a bit by using L. to sampling where here the. The improvement is relatively modest it's just trying to reduce the provable dimension dependent factors running time factors for the for the queries. That is trying to do things a little bit better with respect to not the number of points for the nearest neighbor searching problem but the dimension that the points are sitting in but here let me just start out with the seven Iraq musician so throughout we have in points indeed mentions. Or we could we could consider them to be an N. by D. matrix so the the I throw the this. Matrix is one of our points and I'll call it. So there it is. One of the problems that I look at is. Well is the problem solved by perceptron pts. Here were given some points and there are colored let's say. Red and Blue and we want to split the colors into so that so we have monochromatic point sets on each side of a hyper plane that is we want the blue on one side the red on the other and this is a solved problem solved by the classical perceptron algorithm it's also of course a particular kind of linear programming problem. Another problem that all that I mentioned today is the minimum and closing ball problem which is to find the smallest ball containing a set of points containing this set of input points and this is a particular kind of a particularly simple kind of quadratic programming so a quadratic objective function linear inequality constraints. OK Now what did we actually do so we produce sub linear time that is not not looking at all the data this is not a parallel just emphasizes not that we have parallel augur thems that that work in time we actually avoid looking at all of the data. There are Monte Carlo. So there's a probability that our our our results will be not correct although we can make that probability very small their approximation algorithm so there's an additive error epsilon. Which is part of the parameters to the input and this is this is additive error and scaled by the maximum of the Euclidean norms of the. Of the input rows. So we assume for the purposes of discussion I'll assume that we just have a collection of input vectors that is the unit vectors that is the rows of The Matrix or all within the units Euclidean ball. And under that assumption then these various things follows but mainly we just need to know what the scaling factor is in order to make that true. And we have not only these these algorithms we have matching lower bounds for them. That is we show that you just cannot within the particular sublimity or regime that were looking at you cannot do better then the results that we have up to poly logarithmic factors right. Perceptron ths we can solve that problem. Separating the reports from the blue points we can do it in time which is in plus D. over epsilon squared. This is as compared to the various algorithms but the classical perceptron takes time which is proportional to one of reps on squared times the number of non-zero entries of the matrix. So if we have a moderately dense matrix then then we went. And here. The objective function is actually the thickness of the slab which separates the red points from the blue points and so our epsilon is the thickness of that being the so-called margin and so our epsilon is the additive amount that were off from the best possible margin. OK so. So. That's one of our results. Another one we have something running time fairly similar for the minimum in closing bell problem except it's a little bit better with respect to the dependence on the dimension. So it's an over epsilon squared plus deal over epsilon squared. So do you know for epsilon. Now the State of the art for this problem is a little bit better with respect to running time so there was a sort of paper. In the paper in the last soda in which they showed it was possible to solve it in time which was proportional to the number of non-zero entries of a divided by the square root of epsilon K. So here we need to have denser yet matrix in order for our results to payoff. Also we to say that it's linear is cheating a little bit. You need to know the actual lengths of the input vectors or rather the you know the distance of the input points from from the origin. Also there's a little bit of a condition on the lower bound that it has to be a an output where the center as I will describe an image where the center of the minimum and closing ball actually has to be given the approximate minimum of clothes and ball has to be in the convex hull of the input and this is not a very strong condition most most algorithms I don't know the algorithms that do not satisfy the property that the that the center point that they return for this problem is not inside the convex hole of the input points but our lower bound requires is only for algorithms that have that property. OK now. I won't talk about many of these things but we can extend these results to slightly more general kinds of quadratic programming including what's called the The L. to support vector machine. We have Colonel I's versions of them where we have an additional additive term in the running time that depends only on absolutely. On we have streaming versions we have parallel versions. We beat it pretty much to death. So so those are our sort of prove theorems these are our results. Now I'll try to talk about. The minimum closing ball problem. And try to describe what the try to give you an idea of what the R. algorithm does. OK so one version of this problem is that we're trying to find the the center point X. which minimizes the the radius of the inclosing ball centered at X.. So there is some function our survey of X. which is that which is that squared radius and the squared radius of the minimum and closing ball are there of sorry of the ball centered X. is the maximum overall input rows of the distance of the square distance of X. to those rows. OK so just reading off what this says I hope so. So the problem is to find a center that minimizes the maximum square distance. So an equivalent more elaborate way to put that is instead of talking about the maximum over the rows. We had we take a weighted linear combination of the disc the square distances weighted according to some vector P. such that P. is in the unit simplex that is its. Has non-negative entries and its entries all sum to one. So it's a probability. OK so. So in Taishan the unit simplex is this thing. This is equivalent exactly equivalent to the to the the problem as I stated it above there because obviously the P. which does the maximisation here will put its all its weight on the road a surprise. That's farthest away from the given X.. OK So we haven't gained anything or. Or lost anything except possibly some some simplicity. Case but but expressing the problem in this way as maximizing over the simplex will be it will be helpful to us in formulating our algorithms. OK So the way that the algorithm could be described as there's a series of rounds from little tea little tea going from one up to big tea. Where there are sort of two players that are in charge of the current X. Factor and the current peeve factor so piece of T. and accept. And the max player tries to find the. The maximizing piece of tea in the unit simplex. OK so. So again for a given X. the best thing to do is just to throw all the weight onto some particular one of the coordinates of P.. Now this that that is obviously the best possible thing to do for a given X. but. It's not all that great to do as a response to what what the the other player might do in the future. So a less shortsighted way to to maintain this piece of the vector is to multiply the I coordinate by term by a value which is a function of the distance of the current X. vector to the I throw so. So what happens if there's some some weight vector piece of T. piece of minus one. That weight put it put some weight on some weight on a step too and so on and so forth. If one is really far from the current X. Factor X. T. minus one then then the new weight for piece of T. of one the first coordinate a piece of T. will be multiplied by a large quantity. OK so. So it just kind of it doesn't throw all the weight on the farthest one it throws a lot of weight on the farther ones. So to speak. OK now this is done according to some suitable value. Which is pretty small and then having done this multiplication this scaling we normalize so that in fact we stay within the. Within the unit simplex. OK now this is called is not is a very standard thing in for some people I guess it's called the multiplicative weights update of the piece of the vector. OK So that's this is all what the max player does. If I'm in player tries to find this this good center point four for the minimum and closing ball tries to find the the exit see that minimizes the maximum distance or minimizes actually the. The weighted linear combination as according to what the max player is doing OK so for a given Max play for a given vector P. in the unit simplex given set of weights the best response for that particular that particular P. is a weighted linear combination of the rows that is based on coming from exactly that set of. Of weights from from P. So this is a transpose P. is just the weighted linear combination of the rows weighted according to the current vector Piso that's that's that's the best X. for that. P.. OK now that if we wanted to D. be less myopic less stored sided. The way that we update is to simply take the current X. a T. minus one the current the current center and average it with this myopic update with this transpose piece of T.. OK now this is it turns out is what's called an on line gradient descent update for for X M T A. So we're using just the interaction and for this formulation just the interaction of two very you know bog standard techniques. Sorry to two standard techniques for online learning. OK so. So here's the them in a box. We start out with P. one being just uniform weights on all of the points because we don't know anything about them so far. X. one being just the the best possible response to that to that P. one namely just the average of all the points the centroid of all of the points and then the algorithm repeats for T. steps. BIG T. steps. Does that. Multiplicative WAIT WAIT's update of piece of T.T. normalizes by the L. one Norm a piece of T.. So if I can show that there. Well there it is does this particular convex combination of X. of the minus one and a transpose P T and the upshot of that is that X. T. is the weighted linear combination of the rows which you get by just taking the average of all of the piece of T.V. actors that you've seen up to now so for tau going from one up to little T If you just average those then X. of T. is sort of the best response to that. OK so. So here's an algorithm. It's nothing particularly magical about it. And it's not too bad I mean it at least matches the classical perceptron So that's that's something. OK's which is to say that if you set Big T. the number of trials equal to about one over epsilon squared. And then you take the average of all of the exit T.D.'s then what you get is something which is almost as good. The best possible the center. It's the. The the maximum square distance that you get for that X. bar is with an additive epsilon. The maximum distance square distance that you get for the minimum that you could possibly have. OK so. So we've done a lot of shuffling around in order to get something that's almost as good as the you know algorithm known. Back when Sorry not not the perceptron algorithm but very standard kind of algorithms that have been known for a while. OK the. The work per iteration here is equal to the number of non-zero entries of A which is at most the number of rows and times the number of columns D. and where does that come from a we need to compute the weighted linear combinations and we need to compute all the square distances the square distance is actually by simple. High school algebra we can expand out so that we need to have the dot products of every row with the current X. of T.. Which is to say that we need to prove the matrix vector product eight times X. of T.V. and again in this formulation here in this in this it in this expression here. We we compute X. and T. the square Euclidean norm of X. of T.V. and we assume that we know the square Euclidean norm of each row so. So that's what that's what the work done is here. OK performance so the performance here is like the classical perceptron are like these classical algorithms for a minimum and closing all but more complicated U.K.. However aside from these these two matrix vector products the work is. Is is all of N. plus D. per iteration. OK So we have some improvement in the performer we have except for the minor detail of computing all of these the square distances in these weighted linear combinations. We have something that isn't doesn't take too long. OK so. So now having built it up this way when I'm going to say it what I want to do is randomized both the what the the max player does and what the min player does. Ok such that then the total work per iteration is zero. If and plus the. OK so to update the X. Factor. We use something that I'll call L. one sampling but it just means we pick an index we pick a row with probability equal to the weight the piece of T. weight of that row and then use the the course that row as the the update for the convex combination update for X. T. instead of using the full weighted linear combination as in the deterministic Obert. OK so the expected value of that update is fairly obviously actually equal to that weighted linear combination so. So in expectation this randomized thing which takes you know. The time is doing the same thing as the full deterministic that. OK but it's been just an expectation we don't know anything more about that. So the update the the piece of T.V. actor the weight back there. We use not exact evaluations of all of these distances. But just approximations of them estimators of them. Where again we you know expanding out the square distance to things that we know X. to your where you were willing to compute it it doesn't slow us down to compute it. We assume we know the ace the. Norms of the by. So what we need is an estimate of each of these dot products. OK so here's here's where this this two sampling business comes in. And what we do here is to pick a coordinate index with a probability which is proportional to the square of the current coordinates pick pick J with probability proportional to the squared coordinate of X. T.. OK And first we might we might. Normalize so that X. T. is a unit vector in order to get probabilities out of all of this but we just we pick one of the coordinate indices in this way. And then we use our estimate or something that's just the. The corresponding Jay-Jay city thing tree of the I throw divided by this this this coordinate. OK Now we do this combination of things so that what we have is something that's an unbiased estimator of the dot product. OK but we just we just selected one number divided it by another number in order to do this so. So in expectation. We're we're good here also. But you know we haven't said anything in for here or for the for the other case about the variance. So we don't you know the or how good these estimators are how concentrated these estimators are in another way to put this which might be helpful is just that. Actually we we don't do this do these random stroke choices separately for each row we pick one. Gees of T. based on the current X X T vector and use the corresponding column of the of a matrix scaled by this probability such that we get the right expectation when we're done. OK so. You know in the in the there's a nice symmetry here that that in the. For the X. Factor we pick a random row and use that as the update for the fact that we pick a random column and use a scale version of that as the. As the update. Or is it to build the update. I should say so we're just swapping back and forth between rows and columns of this input major X. and you look troubled. No OK. You look troubled. Yeah. Yeah I mean it's yeah it's it's primal dual which I guess you know a lot of. A lot of optimization algorithms are kind of in this ballpark right of swapping back and forth between the primal formulation of the dual formulation and the I don't I don't. I'm not sure I can speak to the particular case that you're referring to there is no I'm not claiming any great novelty to the. To the overall scheme here I guess so. So it could it could well be. Yeah. So yeah I should say we so we have this piece of act there it has entries. We only do this. Updating you know multiplicative weights update scaling we also. Build a little data structure so that we can randomly sample one of these one of these car rows you know pick pick one of these coordinates with a probability which is proportional to the to the current weight. So the cost. There is you know one way to do it would involve prefix sons and then doing a binary search. So there's an overhead which is law again. Let's say so I'm I'm I'm ignoring. You know significant for a minute for some purposes but I'm ignoring a lot of factors. In this discussion but but the cost would be and similarly I guess for the X. factor that you need to preprocess that X. factor so that you can. Do these coordinate samples. Actually in this setting. Since you only use that X. factor once. Well you have to do something you have to walk over these vectors and do something with their with the values in order to figure out the probability distribution but it's still zero event for the actor and over D. for The X. Factor so that's that's included in or within log factors of that. So that's that's included in the accounting for this. OK. So the thing here is so I'm using these very crude estimates for dot products or for matrix vector products in both cases but the thing is that the this randomized version behaves very similarly to the deterministic version in the sense that you get pretty much the same guarantee for the output for about the same number of iterations. So despite the fact that it's doing you know do it using these cheap estimators and doing this fast work each each iteration it ends up looking about the same. OK so. So certainly for a dense major sees if there's a lot of non-zero entries we win for this. And I should say for the minimum inclusion ball problem also there with with a few with a few other tricks we we sharpen it up to the time found that I that I that I said but I'm not going to worry about that the moment. So one question you might have here is well OK this is all big you know it's all hand waving in this theoretical case the it's not even a big it's big go till the so it's covering up could be covering up a lot of sins and so maybe there isn't this good relationship. In practice between. These two things between the randomized version and the deterministic algorithm and now I have. Some of javascript and also some some graphs to show that that not necessarily that the randomized algorithm is sort of state of the art with respect to the best way that you might ever solve this problem but at least we're not hiding too many sins in this. Going from the the deterministic to the randomized thing in terms of the number of iterations. So here's the two algorithms running. The blacked out with the with the Red Ring is the Center for the digit the deterministic algorithm the black is the Center for the randomized algorithm and you can see that after not too many iterations the randomized algorithm kind of gets close to the deterministic one and I don't know if that's actually round for you. It's kind of oval for me but it is supposed to be a minimum and closing circle. So the colors there are showing the are done using the peeve factor so. So everything starts out the same uniform yellowish and as time goes on the the input points that are near or far away get red get red as in hot let's say which is to say that they are likely to be chosen the ones that are close to the center become blue because they're less likely to be chosen. OK so it doesn't take too many aeration that one would think this particular example. The centroid of the of the input data is itself actually a pretty good center. So the deterministic algorithm which starts out with that centroid doesn't have a lot to do so that's why it's why it behaves in that way. However let's see if I have this here. Yes. Here is some input data where the minimum and closing bell. All where the centroid is not just the immediate best center for the minimum closing ball or very close to it and here again we see that the that the randomized black dot doesn't take too long before it matches up with the deterministic. Red Dot red red rain. OK so. So here. The. The center of the circle is not near the central wood and yet still the randomized algorithm behaves in about the same way. OK. Questions about that. OK now. So so one thing here is that I'm cheating a little bit in that this is two dimensions right and these algorithms are really only interesting for a very large dimension. So you know estimating dot products for DOT for four vectors with two coordinates is not very exciting. So so what I what I should have mentioned earlier perhaps is that what I did to try to get a good idea here of the relative behaviors was when I estimated the dot products did it did these random experiments to estimate dot products for the randomized algorithm. I just threw in some noise in order to simulate the behavior of the higher dimensional thing using giving throwing in noise which has the variance that I know that this estimate or will have so this is all kind of cheating. However. I do have some some other comparisons of just just the randomized vs the deterministic algorithm for various datasets that I got at a site called M.L. com where I looked at various cases various of the public data sets that were available I looked at various cases of them and what's shown here is the. The ratio of the of the objective function values for the randomized versus the deterministic for here's twenty five trials fifty trials. T. quills twenty five fifty and so on. Now I'm now this is a little suspect because I'm not quite sure why the randomized algorithm actually has a better. Objective function value early on but perhaps because it ignores that the outliers that I've thrown in in order to to make sure that the terminus tick algorithm always has something to do but anyway the they behave very fairly similarly And this is with with you know quote real data unquote and with you know and it's fairly high dimensional it's not cheating in the way that the the javascript version was OK And still to do here is comparison with with some other you know real competitive outcome. You know the real competition for this problem right now this work is related to. So-called core sets for minimum inclosing ball where a core set is a small subset of the input points that specifies an approximate solution. So a core set for the minimum a closing vol problem would be it would actually have about one over epsilon points so that the relative error is about. One plus epsilon so you. There is this small set of for an epsilon core set you take its minimum and closing ball blow that ball up by a factor of one plus Epsilon and you contain all of the original input data. So these are these of scene of applications. Since they were introduced a while ago they're related to the classical Frank Wolf algorithm. There are kind of interesting objects for various reasons. It turns out that what we are doing in our algorithm is. An elaborate yet another proof that corsets exist. That is that the heroes that are algorithm pick picks are put put them all together and you get a core set in the sense that I just described. OK so. So this small subset of the rows together with this small subset of the columns that we get our algorithm chooses specifies the solution that we ultimately half of the problem. And this is true for the classify results as well as a minimum close involved so this is somewhere in the space of what low rank approximation of matrices does in that you know you're a small subset of the rows a small subset subset of the columns put them together multiply them and you get something which is a good approximation to the input here we have something which is kind of gives us as an approximation to this to a very different optimisation problem in in a similar fashion. OK now I guess well yeah. So so are perceptron algorithm as opposed to the minimum inclusion ball that I just described in detail perceptron algorithm could be described as finding the minimum over a unit simplex of the maximum within a unit ball Euclidean ball of this by linear form. P. transpose a T. A X.. OK So that's that's that's a way to describe another way to describe the separation of blue and red points is that we're trying to find. Proxima solutions to this min max problem. OK Now there's some previous works that is primal dual. And also uses similar kinds of randomization except that it's. Looking at the same by linear form but with X.P. and X. being in the unit simplex instead. You know simplex unit ball. So this is actually this actually as opposed to my vague statement about two players and so on. This is actually producing strategies for a game correspond to make strategies for a game and the results on that for this problem actually are better. They get a small relative error as opposed to a small additive fair. I would I would argue that this is an easier problem. Roughly speaking because no one sampling is easier has better properties than L two sampling and another thing another direction going in the other direction of spaces over which you might optimize if you you doesn't look at it it doesn't quite work to look at the. Min Max version of this problem but if you maximize for P. in the unit Euclidean ball and maximize for X. in the unit Euclidean ball that that same by linear form then what you're talking about is the spectral norm of the matrix. So it would be it would be interesting to try to come up with a sibling or time approximation algorithm for for the spectral norm of a matrix. OK. I'm not too bad I guess. So the first thing we thought of to try to do something kind of faster for these problems with you. Sketching or. More classically the Johnson Lindenstrauss projection. J.L. also called just random projections where what we would do is to map each input row. To some vector which is in a M. dimensional space where M. is about one over epsilon squared times times a log factor. And it's been known for some time that this can be done such that. What we then have this epsilon error. For. Estimating the norms of the race abide by using the norms of the ace of I told the we can also do it so that we can estimate the dot products of two these rows. By using the dot products of two of these. The tilde of the projected versions of them. I guess I mean wrote a book about this. So you know there are there are other books about this. There's many things about this but anyway here's a one one minute introduction. So so the notion would be we just do this random projection. And then solve the problem in the projected space and then map the result back so this this would this would this is a workable scheme this would work fine but actually it's slower than the approach that we did you do more work. Just computing these sketches at least in in the usual in the the standard ways of computing the sketches you do more work computing these sketches than are algorithms. Without bothering to do the sketches. Moreover you'd actually have to store these sketches which which is not necessarily a lot of space but might not be something that you want to do on the other hand if you have data a fire Who's of data going by this matrix you get one of these input rows you do something with it and then you throw it away and you get another one. And so on. Maybe you want to store these sketches you know and throw away all your input data and so. So then you could use this kind of approach but using our algorithm with those sketches would still be faster than previous algorithms without using those sketches because in the case of the sketches the The Matrix is dense it's complete it has no nonzero entries basically so. So we still get an improvement even even in the setting where you want to sketch then throw away your original data. OK so. So I like this all to sampling business with just picking So when you have a Euclidean unit vector you pick a coordinate with probability proportional to the square of the quarter value in order to estimate dot products. You know it's so ridiculously simple and it's actually better for our application than the than the advanced beautiful technique of Rana projections. So Lisa the question well are there other things that you could do with it. One of the other things. You can do with it is to estimate Matrix Matrix products just as we are just estimating a number of matrix vector products you could estimate Matrix Matrix products. Obviously in this way. Now that this is a nice insight but it. It was it's been known for at least ten years that that this is a nice way to roughly estimate Matrix Matrix products so. You know. So we're behind the curve a little bit on that on that score another thing that you might do with this is as feature selection that is you. You run your perceptron algorithm you get a a an output X. which is a. Normal vector to the separating hyperplane. Now you you want to actually use that vector to classify new input points. So one thing that you could do is to do L. to sampling for enough times with with that is you do it I'll take an L. to sample coordinate take another L. to sample coordinate if you do that enough times like one over epsilon squared times then with those coordinates you can take an unknown factor produce a dot product estimate which is comparable in quality. Roughly speaking to what you get with a ran a projection. OK so so that is to say there is kind of a so. So the dot product that you would take then you would do whatever epsilon squared work you would even look at all of the input point coordinates for this point that you. Trying to classify So one way to say that is that this is a kind of very. Crude sort of feature selection process for classifiers it's feature selection where you. You know you don't really do very much work in order to do it. Another thing which. Which I really hope that I will I will get to hear is is a way of speeding up some aspects of nearest neighbor search in the setting of Euclidean space us. Sorry. OK well I guess I will get the nearest neighbor search OK so. So here the problem is that we want to build a data structure for nearest neighbor searching for the rows of A So that is we want to build a data structure on the on the rows. So that given a query point Q. we can find the closest row to Q. in Euclidean norm and we can do that quickly. And I may call in the in this near in the nearest neighbor setting. Sometimes the people call these these input points sites which I will probably probably do on occasion. So there's very many ways that people have. Have tried to do to try to build data structures for nearest neighbor searching one is this random projection method which preserves norms preserves dot products preserves distances and so on. I would say L.S.H. for in the Euclidian cases is as a sophisticated form of Rana projections maybe even exactly Well it's a sophisticated form of doing these these random projections where again we push it. We're projecting down to a small some small dimension and then just working in that using your favorite small dimensional data structure in that in that case. Another way is to take advantage of the low. Intrinsic dimension of the data whatever whatever notion of intrinsic dimension you might happen to have where the doubling dimension. Spear packing properties. Doubling measures. There's a variety of points that are all on a on a low dimensional manifold. There's a number of different notions of what intrinsic dimension here is and algorithms that take advantage blow intrinsic dimension typically are able to use the the distance function between sites or between sites and these and so on. As a black box that is they just know you just have some function that the user provides that returns the distance between sites five and ten and that's all you know about the the distances. Another thing is to use a hierarchy of subdivisions or enclosing so shapes in order to guide search another kind of scheme is to do work which is proportional to the number of sites but actually does some smaller number of distance evaluations and so which somebody probably can tell me what that stands for because I forget there are various there's various lines of a line of work doing that in or where you do you build. Your computer precompute a bunch of distances. On the sites in order to avoid doing distance evaluations given a for particular query point. Then there are there are things where you put for example the sketching approach together with the low intrinsic dimension and India can now or did this with the doubling dimension Beranek and wakin did it with points on a manifold or showed that it was possible to use that approach with points on a manifold. You could also people go up to and fro and put together approach one and approach three more or less roughly speaking. And have have various nice results there. So. Thinking about this L. two sampling and so on. I would came up with sort of another yet another variation on of combinations of these approaches which is basically to use these intrinsic dimension methods that have data structures that work just in this complete generality of metric spaces use the distance as a black box but let's actually open the black box for Euclidean nearest neighbors and do some try to try to do the distance evaluations in that case faster so we use this general purpose data structure but then use adaptive accuracy distance estimates to to actually you know within within that. OK so a disadvantage of these general purpose data structures. Is that they generally don't do that right. And so they you always pay the full price for every distance evaluation. Or. Certainly you don't take advantage of the particular structure of a given metric space that you have. So so why do this. Well if you have sites that are kind of easily discarded as possible nearest neighbor then you do even if you do less work for them. You compute the dot products the distances to even lower accuracy than you might otherwise. If you could do this kind of approach using either sketches or L. to sampling and with if you did it with the L.T. sampling approach then you wouldn't actually need to store. Any sketches you could just keep your original raw input data around and use that use the old L. to sampling technique to do distance evaluations on that. OK So this is something emulating something in the emulation of the methods or the K.T. trees the inclosing. Hierarchy of shapes kind of approaches metric ball kind of approaches not those actually but for various approaches one key advantage that they have to. Is that you get pretty far along in the processing without actually doing a full blown distance evaluation you look at this coordinate then this coordinate you know search down in your data structure using just one coordinate at a time and make a lot of headway without doing a full a full distance evaluation so I'm trying to just you know trying to see how far this kind of notion could be exploited in this particular setting. OK Now another reason to do this is just to try to mention. An ancient result about exact nearest neighbor search in metric spaces which I which I came up with where I analyzed it. For bounded doubling dimension and for random sites and queries where basically the sites in the queries. The basic model is that the sites and the queries come from are I ID according to some distribution that you don't know but you do you know that they are at least independent identically distributed. And the there are there are many. Nicer subsequent algorithms almost all of them to my knowledge are. Approximation algorithms they don't solve the exact problem so whether this is useful to you or not at least it's a novelty for this data structure and I think we're going to have to figure out how to wind up here without without going too crazy but the data structure not saying how it is produced but it is simply for each site J A list of sites else object. And the query procedure is just that there starts there's a current candidate site see nearest neighbor site see which is initially just the first site in your list in all the sites and then for that candidate you walk down its list trying to find something trying to find a site A which is closer than that current than that current guy. If I do something maybe that will look better or not. OK so. If the if that site a in that in that list is in fact closer than the current candidate then you make a the current candidate and just continue. And if there's no such site if you run off the end of the list for this candidate then you just return that candidate you're done. So this is at least the description at this level of the data structure is fairly trivial and. The basic idea here skipping over. To slides which isn't too painful. I guess is simply that. The number. The total number of lists that you that you look at list entries that you look at is larger than the total number of different candidates that you have and the approximate near a proximate distance evaluations can be used effectively for the list entries so that you speed up most of the work that most of the distance evaluations that are done by the by the query part of the data structure. You know when the when the data structure is used for a query can be sped up in terms of their dependence on the dimension. And so yes either sketches or sampling could be used for this and L. to sampling might work better. Actually when the when the data is sparse. So I better wind up. A C. so I mention that the bounds are tight up to probably logarithmic factors. I mentioned that may have mentioned that that we get an improvement in the balance where we look at the running time of these of these algorithms for perceptron for minimum closing balls not from the point of view of how the error depends the error epsilon depends on the number of input. Points but on the risk of the classifier that we get when we're done with this so that that we would look at the running time of these out these training algorithms as a function of let's say the dimension and the risk and we did improve bounds also in that style of analysis. These are of course to cast a gradient algorithms in both directions. But not uniform stochastic gradient algorithms. What we what we would really like I think is is an algorithm not that's this nice a billion Euro T. business but is just simply roughly linear in the number of non-zero entries of the input matrix. And I think I will just thank you for your attention now that I've run out of time. So thank you for your years. Yeah I'd preprocess the query given a candidate in a query vector I could preprocess so that as I walk down the list for that candidate. I could do approximate distance evaluations. So each to for each candidate I actually have to touch its coordinates in the coordinates of the query. But but for the entries in the list. I don't have to when there is more List entries than there are candidates. I actually really had to be much nicer if it was somehow just looking once at the query and then not looking at any of the any spit any of the candidate coordinates either but I don't know how to do that at the moment so. But that's that's the general idea that the number of list entries is large and also that the number of points on the list that are. Nearly as close as the current candidate is actually also going to be provably small. So for most list entries we know that they'll be far far enough away that we'll be able to quickly discard them. So. Right so that for the for the perceptron algorithm. It's more or less what you call it information theoretic that is it's not there's no kind of complexity assumptions it's basically there's a needle in a haystack of you know you could have a. It's not quite this but you could have a collection of points that are almost all zero coordinates except for one which determines everything and you pretty much have to look at all of these coordinates. You know all of the input coordinates in order to figure out what's going on there. This is in a in a setting where the where the dimension is actually equal to one over epsilon squared. So you know so that the you know the matches the the. The upper bound you know the you have to you have to pretty much read the whole matrix to figure out what's going on when the dimension is is one over is about one over epsilon squared. So. So so I kind of I sort of alluded to. Kewpie in the simplex there it is which is to say you have a quadratic objective function and you're trying to optimize it within within a simplex. Well that's quadratic objective function in P. and an X. X. is in the unit Euclidean ball P. is in the unit simplex. So not. It's general enough that I know that this to S.V.N. is included is you know we can do well to S.V.N. we can't I don't know how to do L. one S B N That is the. You know to really encourage sparsity of the normal vector we don't we don't have an algorithm that quite does that. But you know we can encourage sparsity in terms of minimizing the norm of the normal vector not the L. one norm which would be the better thing to do to try to optimize it so yeah yeah. Quadratic not we. It also would be nice to kind of have something for problems where. As with this one. You know the one norm of something in the objective function it be nice to extend it to cases where the objective function is not not smooth. Which basically we can't we can't handle at the moment. Yeah. So I don't and should know more about it myself but it somehow it. You you you add in order to control the stepsize you added this proximal term. To the objective function to so that so that the result the the minimum the minimizer for that for the real objective function together with the approximate term has takes you only so far away from your current interests and that's a smooth thing but then then in part of the steps if you do a linearization of the objective function value then you have this nonce move part together with a basically a distance so Euclidean distance a square Euclidean distance. And for that one special case of non smooth plus squared Euclidean distance. It's possible to to solve it quickly. It's this kind of this so-called shrinkage threshold method I guess. So it kind of reduces to a snotty smooth problem that's actually that in that one that that particular thing is easy to solve. So. I.