[00:00:07]
>> In. The last. Month or. So. Thanks and it's a pleasure to be here it's been a great morning so far in Atlanta. Yes this talk is based on joint work with Jerry and he hey which will appear in this years in Europe's in about a week so the high level area for the talk is robust to 6 So let's 1st review some super basic non robust a to sticks what is statistical estimation.

[00:00:58]
Broad range of problems where you get to see some independent samples from some probability distribution p. parameterized by some parameters data and your goal is to find something about theta so in this framework you could think of estimating the mean of a distribution or its covariance or solving a regression problem or finding a principal component.

[00:01:17]
You know almost many of the day to day kind of statistics tasks you think of and in almost any of these problems you can think instead of instead of this model you can think of a robust defied model. Where instead of seeing a set of id samples 1st the samples are drawn and then they're handed to a malicious adversary who can modify any epsilon and of them that she likes think of epsilon as a small number either a small constant or are going to 0 at some rate.

[00:01:44]
And then your goal is to again perform whatever estimation task you had originally but now with corrupted samples. And this set up this models all kinds of interesting things it could model the obvious thing which is like malicious data poisoning you could think of it as modeling some kind of random corruptions in your data and being very conservative model that you could also think of it as capturing the situation that you were trying to fit a model to your data from some class of models that didn't really contain the true model instead of just contains some model that was epsilon close to the true model in total variation distance so it captures a model Miss specification.

[00:02:20]
So a high level goal is. Can you know what statistics problems especially high dimensional statistics problems can you still solve with reasonable provable guarantees from corrupted samples so. A very brief and opinionated history this has been studied for decades in great detail in the statistics community going back to the sixty's and the for for this purposes of this talk anyway the theme from robustus mistakes between the sixty's and the 2000 was that many things were well understood from an information theoretic perspective and they were well understood in one dimension but the algorithms that were developed by statisticians.

[00:03:04]
In this timeframe typically had running times that were exponential in the dimension and this means this is bad news if you want to do something with a high dimensional dataset and conversed with a polynomial time algorithms that we knew typically had a badly suboptimal statistical error right they just they were bad estimators of mean and covariance and regression error or whatever.

[00:03:27]
And then in the in the 2010s especially in a couple of breakthrough papers and 20161 of them from from Georgia Tech we saw basically the 1st polynomial time algorithms for high dimensional robust to 6 that achieved nearly or actually information theoretically optimal statistical errors Ok so we're going to review at least one of these breakthroughs in a moment.

[00:03:51]
So this is and then there's following these breakthroughs in 2016 there's been just a total flurry of papers getting efficient polynomial time algorithms for a whole bunch of high dimensional robust to test express ones. But we're not quite to practicality yet so these algorithms typically they're polynomial time usually they're at least quadratic time in dimension there's there's sort of good reasons for that and as if you you know really work for the high dimensional dataset you know that quadratic in dimension you're already doomed right you need nearly linear time algorithms so the goal for this work is to leverage some of the insights from this recent flurry of poly time algorithms to get something practical to get nearly linear time and something we can actually run.

[00:04:35]
So the problem we're going to study is robust mean estimation which is like the simplest high dimensional robust a to sticks problem so the goal is you get epsilon corrupted samples from some distribution deal on our little d. and you always need some assumption on the underlying distribution to solve these problems otherwise you can't distinguish between the original samples and the epsilon corrupted ones.

[00:05:00]
So the assumption we're going to use is pretty standard it's just that the covariance of that distribution is bounded and the goal is to estimate the mean of the underlying distribution in l. to err Ok simple problem so you can imagine your samples you know maybe they look something like this in blue you have good samples and in red you have bad samples that the adversary out in your goal is just estimate the mean in this case it's at the origin.

[00:05:24]
Here is just to get ourselves acquainted with the problem and what kind of errors you could expect a folklore fact is that if you have enough samples it turns out deal over epsilon samples suffice you can estimate the mean by some estimate or mewe hacks which incurs this much error Ok the l. 2 error with high probability will be order of square root epsilon don't worry about the fact that it's square root that's not so important what's important is that this is a function only of Epsilon the dimension doesn't factor into this error rate Ok this estimator from from folklore though this would require some kind of brute force search in high dimensions this is not not efficient will sort of see with effort for search would be in a minute.

[00:06:06]
But because the contrast here is that Intel 2016 or maybe 2017 depends exactly what paper you are thinking of. Polynomial time algorithms that we knew for this problem would have incurred errors that were growing polynomially with Dimension right order of like epsilon Rue de would be a typical error rate from these papers or from these from from sort of naive polynomial time algorithms and then we saw.

[00:06:34]
In some papers from 262017 including this one bias on others. We saw a polynomial time algorithms for this problem with basically this guarantee Ok this order of root epsilon guarantee. But these algorithms the fastest running time that I'm aware of for these algorithms was quadratic in dimension. And the main contribution of this paper is the 1st nearly linear time algorithm for robust mean estimation with this dimension independent.

[00:07:02]
Yeah. Norm is all too yeah. I mean you can ask the question for other norms but this is just that we're just simplest problem we can imagine because our focus is on the running times Ok. Yes he says he says so I'll say I'll state a real theorem in a minute here again and has at least deal over Epsilon to get this actually in a minute I'll tell you the theorem with the full you know all the dependence on and indeed everything.

[00:07:34]
Ok but just to give you. Place this in a little bit more detailed context I know it's a big table but now I'm going to tell you what the running times were. So the 1st algorithms for this problem were in these papers from 2016 Actually I'm fudging history a little bit sometimes people made slightly different assumptions but let's not worry too much about it.

[00:07:55]
There's some convex programming based algorithm that has polynomial time but this is actually the row I want to focus on Actually I don't know if I'm getting I'm getting the runtime of hers of s.p.d. right what's the. Is the optimistic version Ok. So so there was a algorithm also in 20162017 the these algorithms are spiritually similar I think even though they're not quite the same I'm going to focus on this algorithm called special filter will see it later that has a running time quadratic in Dimension earlier this year some folks actually got rid of the quadratic in Dimension running time but they incurred a really bad dependence on the fraction of corrupted samples one over Epsilon to the 6 and this was by reduction to some kind of semi definite programming.

[00:08:38]
This is nice in the sense that it gets rid of the quadratic dependence but a problem with reduction to some definite programming is that even though we have official solvers in theory they're not so efficient in practice. In this paper we introduced this algorithm that we kind of hilariously called the quantum entropy filter will see in a minute has nothing to do with quantum but it's near IP So we had to make it a flashy title.

[00:09:02]
Where we get rid of the quadratic dependence and we don't encourage this epsilon factor in this algorithm is implementable. Ignoring logs. That's right we have probably long dependence but. The polynomial hands and. And then simultaneously actually the another group got rid of the one over Epsilon to the 6th from the sum a definite programming approach.

[00:09:24]
Ok so now the actual theorem. There is an algorithm that takes epsilon it takes epsilon corrupted samples for a bounded covariance distribution now I'm not making assumptions on an exam going to tell you the right error rate with respect to n. with probability point $99.00 it finds an estimate or a new hat with this l. to err Ok so.

[00:09:50]
The amount that we pay with respect to corruptions is root Upsilon the amount that we pay in just ordinary statistical error is root de la di over n. and the rate comparison here is that. If I didn't have any corruptions at all the empirical mean would pay over and so up to a log factor were basically competing with the empirical.

[00:10:13]
The running time is n.d. times of poly log. And as some bonuses. We get a couple of things One is that it's been known in this robust to 6 literature for a while that if you make stronger assumptions about the distribution d. you can improve the error rate from root Epsilon to something like almost Epsilon and r. algorithm captures this.

[00:10:36]
Here you need to go see in distributions with known covariance but that's been true also in the literature. The other thing is that the algorithm we get is implementable and we implement the main sub or team from the algorithm which is an approach to high dimensional outlier detection and at the end of the talk I'm going to show you what I think are some kind of cool experiments that we were able to run on on pretty big data.

[00:10:58]
Yeah. I don't think it's so. Well going to be able to detect that's a good question because you can can you can you can you think any robustly test whether I gave you a distribution of bounded covariance. Ok it could be that in some cases you can there there are there are information theoretic lower bounds that say in the absence of an assumption like this you just cannot robustly estimate is that does that answer your question or.

[00:11:48]
I just use those. Word. With either Ok you say you say. You say I failed and here's why I failed you're saying. This is a this is the question. Yeah I mean some at some point some I mean let's Ok let's let's revisit it later I think you should be able to actually but what's confusing me is whether you're maybe I can concoct some funny distribution that doesn't have mounted covariance Yeah this would be the problem here's the problem Suppose I give you a distribution that does not have bounded covariance the reason for that is some epsilon probability event and now the adversary can remove any samples that came from that epsilon probability event and fool me into thinking that the distribution had bounded covariance So I think I think this this example will basically kill you if you have a distraction that doesn't have found covariance and the reason is some some event that has decent probability then the adversary won't be able to destroy that and you'll be able to detect it.

[00:12:57]
I sense Ok so to get a feel for the problem we have with most of this talk is going to be about designing this algorithm and proving proving it works to get a feel for the problem you know why is this why is this problem hard like why is finding outliers in higher dimensions difficult so let's let's imagine the following situation that the distribution is Gaussian with mean you and let's say covariance identity and now what would be the most naive thing to do as well as algorithm design or trying to estimate the mean you think Ok let's just go look at the samples and if there's some samples that are really far away from the bulk we get rid of them right so what will this do well a typical sample from a Gaussian with covariance identity has distance like square root of d. to the mean.

[00:13:50]
And this means that anything that's much farther away than Rudy will easily get rid of by some kind of naive outlier removal and now you might ask well what can the adversary do under the situation under this situation well she could add samples at distance to the mean and at least naively I wouldn't be able to see what she had done another observation is that if she does that she could add all of her samples at distance to the mean all kind of pointing in the same direction and this would have the effect of pulling the empirical mean distance epsilon from the truth this is the origin of the scaling of the error rates of naive polynomial time algorithms for this problem.

[00:14:29]
So in order to solve robust mean estimation you have to be able to detect whether the adversary has done this to you has she pulled the mean in some direction by some polynomial a growing amount by putting all of her bad samples in that direction so how are you going to detect whether this has happened the key insight from these papers in 20162017 in a nutshell is the following.

[00:14:57]
If the adversary modified the samples in order to pull the empirical mean too far in some direction she also had to increase the variance in that direction right so what's the idea here here the blue samples are good in the red samples are bad and if you projected all of the samples on to this this direction you would see that because she had to put an epsilon fraction of samples far out that witness is increased variance in this direction in a dimensions we usually think of the covariance matrix and here depicted in by the red oval and what this says is that in the principle components of the covariance matrix the top eigenvalue we will see some increase due to the bad samples if the bad samples pulled the mean far away so that's the idea let me let me show you that this can be made formal in a pretty nice way so I'm just going to assume for most of the rest of the talk that we have enough samples that the empirical meaning covariance of the of the good samples concentrates on I'm not going to bore you with with with details on that.

[00:15:57]
But under that assumption here is what these days is sometimes called the fundamental Lemma of robust mean estimation is supposed to capture this this idea that if you shift the mean you have to increase the variance so suppose you have some epsilon corrupted samples from a distribution of found covariance and Suppose you take a subset of those samples of size almost all of them right like a one minus Upsilon fraction of them so you can think for starters just think of taking all the samples.

[00:16:27]
And suppose that lambda is the top eigenvalue of the empirical covariance in my subset again just for starters think of all the samples so then you really can't read that on on the screen you can Ok so so then you get the following inequality which we can read in 2 directions so let's read it 1st going going this way Ok if we read this way this is the distance between the empirical mean of the set of samples as prime and the true mean and what this says is that if this distance is large the adversary has moved my may mean I get a lower bound on the Macside in value.

[00:17:01]
Ok this says move the Mean have to increase the variance on the other hand let's really going the other way if I found a set of samples that had bounded top eigenvalue had bounded Imperial covariance by say throwing out some samples from us then I would have a guarantee that the mean of my new set of samples was close to the truth the suggests an algorithm or at least an approach find a set of samples of bounded empirical covariance and in fact this is the folklore brute force algorithm that I told you about before just brute force over all subsets of size one minus epsilon at some point you'll find the subset of good samples right and that I will have bounded covariance because you know my original distribution had been covariance So now I get a good estimator of the of the me questions either boring or clear it's always hard to tell.

[00:17:54]
The question is how fast can you find us prime That's right so now we've reduced robust mean estimation to the question of finding a subset of samples with bounded covariance and that's going to the theme for the rest of the talk. Ok cool so what algorithm does this suggest So let's not going to describe to you kind of the 1st generation of robust mean estimation algorithms that are polynomial time but a little too slow.

[00:18:18]
So if the maximum eigenvalue of the empirical covariance tells you that there exist outliers where should I go looking for them probably I should look in the direction of the max eigenvector and I should use the information in the principal components and that's exactly what this algorithm called the filter filter does.

[00:18:39]
So what we need to do is find a said a good set of samples as prime. And the idea is as follows let's project all our samples onto the principal component so let's say v is the principal component the top eigenvector of the empirical variance I take my centered samples I just project them on a one dimensional projection and I think of plotting these numbers just brought them in as a histogram and the hope is that the bad samples here in pink will typically have a larger value than the good samples here in blue and indeed you can make this formal and this is this is the main thing underlying the analysis of the filter actually I didn't tell you what the filter is but I will in just a sec.

[00:19:22]
Is that on average the bad samples will turn out to have higher higher score higher projection than the good ones and this means that you could just throw out the samples with a large score and iterate and you would tend to remove bad samples so this leads you to this this filter algorithm which is exactly what I just described intil you have a set of samples of bounded covariance compute the principal component compute the projections of all the samples onto the principal component and remove samples with large score it doesn't matter if so much exactly how you do the removal so I'm not going to bore you with details on that you need to do it in some some slightly better way than just removing you know deterministically because that the adversary can mess with that but if your move samples in some reasonable randomized way you'll be in good shape.

[00:20:06]
Ok so what's the running time of this algorithm this is this is basically the algorithm from 20162017. Well 1st of all what's the running time of the Interior into interior loop the inside loop. I need to touch all of the samples at least once to compute. The max eigenvector And so that should incur run time like and times d. and as number of samples does the dimension.

[00:20:34]
Yes. Yeah and I don't know well why not why can't you do and I mean you. You know and you know even you to poly law grades. You know what I can just do power method I don't need it exactly. There's a gap there's I have a special gap.

[00:20:57]
I can assume there's a special God because if the. The noise can screw it Ok or. Maybe it's not maybe it's not even maybe it's not even clear that you could do the internal loop in your linear time but if we were extremely optimistic and found a way to do the inner loop in near linear time I'm saying you would still be in trouble because you're going to need to run this loop a lot of times how many times you have to run it well not evil you're going to throw out at least one sample per iteration so that's going to take and iterations turns out you can improve that 2 to Dieter ations and this gives you a running time bound I missed some walk factors here but.

[00:21:36]
My understanding is you can prove and Andy squared holy logs so I Ok maybe there's some shading is going on in the structural computations I'm not totally sure but even if you're being very optimistic this is this is what you would expect to get out of this algorithm and my claim is that this extra factor of d. coming from needing to iterate this many times that this is intrinsic this is not really avoidable for this algorithm so let's see why.

[00:22:05]
Why should you have to run this special filtering algorithm order of Dieter ations imagine that the adversary introduces outliers in like distinct orthogonal directions may be 0.01 deal with all their actions so now what will happen in the 1st iteration the principle component if the adversary is clever enough will line up just with one of those directions right so will do will find one of the directions will do filtering will get rid of the samples in that direction but now there's another direction that the principal component will line up with on the 2nd round and so forth throughout the samples only in one of the orthogonal directions at a time and this cost me order of Dieter ations through my filter algorithm Ok.

[00:22:46]
So so the conclusion here is that the spectral filter has some kind of inherent quadratic dependence on Dimension because you need it to get past this what we're going to do is we're going to design a filtering algorithm so it's going to have the same the same kind of framework remove some samples that look like outliers according to the empirical covariance and interest but we're going to try to find outliers in several directions at once we don't want to work with just one direction at a time and just to give you a very high level sense of why this should be possible or what this would mean you know imagine I'm back in this orthogonal direction set up but somebody magically told me say the subspace fanned by those directions right well then I wouldn't I wouldn't assign samples outlier scores based on their projection on the principle component I would assign scores based on their projection into the span of all the outlier directions right and then if you're being off the mystic you would hope that by throwing out the samples with a large score you would get rid of all the samples in one shot Ok there's no formal evidence here but this is this is the kind of thing you would hope for.

[00:23:53]
So this is what we're going to try to do we're going to try to implement an algorithm that tried try to find some algorithm that finds me a large set of of outlier directions and then we're going to score samples based on not quite the projection into their span but something pretty similar and then we're going to run this filter based approach where I just remove samples with large course what do I need to do what are the what are the hurdles Yeah.

[00:24:20]
You know. What. If you if you take everywhere Yeah. I think the claim should be that if. Yeah. But it might be that if the adversary has actually put outliers in every direction in some way that shifts the mean that she had to do it in it's possible she had to do it in a way that the outliers will become noticeable in l 2 norm as she has so few outliers but now now she is to do it in directions she can put only epsilon over d. outliers in each direction to pull to a noticeably move the mean with epsilon or d. outliers have to put them really far away and now I think you should be able to see them in l 2 I think I think so you know like there's going to have to be some it's not like I can just pick so you've identified some subtlety here which is that it's not like I can just always say it's use and over 2 directions or something like that you have to be adaptive to this to what the outline what the adversary has done.

[00:25:32]
And I algorithm will manage to do this so we won't we will always find a set of some size you know it will it will depend on the spectrum of the empirical variance Ok question or. So what are we going to what hurdles will we face in doing this.

[00:25:48]
1st of all we need some way to take let's say again just take the empirical covariance and find a good set of outlier directions actually will find some weighted set about later actions you should think of it. And then you know in this filter algorithm we need a way to track progress like we need a way to show that.

[00:26:07]
That we beat of our filter somewhere so that was the whole point is to get rid of this these directions of our filter so we need some way to show that we're making fast progress in removing the outliers and for both of these We're going to turn to optimization and in particular to an algorithm called Nature's multiplicative weights Ok so.

[00:26:27]
This is a to detail the gender for the rest of the talk. But here's what we're going to do I want to show you some inspiration for this optimization based approach. I want to show you how to turn robust mean estimation into like a nice min max 2 players 0 sum game this is not strictly necessary to understand our algorithm but it will make it clear that doesn't come out of nowhere.

[00:26:53]
And then we're going to go back to these 2 these 2 hurdles and see how this matrix multiplicative weights thing is going to address them this is maybe out of order because then we're going to see what matrix with a look at of weights is and then we'll see the algorithm then we'll see some experiments Ok so the next couple of slides you don't strictly speaking need to understand the algorithm but it makes clear that it doesn't come out of come out of nowhere let's take this.

[00:27:19]
Problem of finding a good set you know we got a party going on over there Morgan having more fun than us. Yeah. So I want to take this problem of finding a set of samples with founded Imperial covariance and turn it into some optimization problem Ok. So the 1st thing I'm going to do is I'm going to relax relax the problem of finding a set of samples to finding a fractional set this is pretty standard just turning it into basically an LP.

[00:27:52]
So I'm going to replace my set with a set of non-negative weights you want to do and they're between 0 and one there some is like one minus epsilon times and that's like please find me a set of one minus Absalom and samples I know you can 1st set of weights you can think of there being a weighted Imperial covariance of the samples which is exactly what you think it is I'm going to ask about the most identity Ok so that's step one No I claim I've basically got myself an often ization problem my goal is to minimize over weights the maximum eigenvector eigenvalue of the way to empirical covariance and you can always take the max eigenvalue of a matrix and you can write it as the max over ptrace one p.s.t. major Cs of the inner product of that matrix with the trace one p.s.t. matrix.

[00:28:36]
So now I've got a min max problem and if you squint your eyes and you pretend that this is linear in both w. and you it almost is right what is the weighted interval covariance it's the sum over samples of w. I x I x I transpose looks linear in w. pretend it is for a 2nd then I'm going to 2 players 0 sum game convex that convex that.

[00:28:59]
Linear function is not quite linear because you have to center the samples according to w.. This turns out to be a technical problem but not a conceptual one let's ignore it for 10 that we have a 2 player 0 sum game basically a convex problem Ok so. Yeah.

[00:29:21]
I. Say Sagan sorry. Sorry yes it is yes the on the w. side you have. Yes you're right and and you can capture the previous algorithms by you know the natural s.t.p. on these it's actually you again have a centering problem where this is not linear in W.'s but there's some pretty simple tricks to make it into something linear with the w's and then if you just run this s.t.p. it will work.

[00:29:55]
Yeah. But if you throw it at a standard s.t.p. solver it's going to be slow so in some sense what we did actually is we just. Analyzed correctly the matrix multiplicative weights s.t.p. solver for this s.t.b. if you want that you can that can be your point of view.

[00:30:15]
Ok so now 11 nice thing. About this actually this is due to a grad student Berkeley Fred sang. Is that you can reinterpret the filter that I told you about before as just exactly being the ordinary multiplicative weights algorithm on these weights w. Ok that's what that's all it is.

[00:30:40]
And our approach as I said is going to be to use the matrix multiplicative weights algorithm which I'm going to review in a minute on the density matrices you. Ok so this is inspiration this is supposed to tell you why using some tools from optimization doesn't doesn't come out of left field.

[00:30:56]
But let's now let's no return to the main thread and go back to these these hurdles that I described before how are you going to find a set of outlier directions and how are you going to track progress Ok and if you haven't seen measures multiplicative weights maybe this is going to be a bit impressionistic as well because but then we will get to some formalities so recall that our 1st problem is I need to figure out what the right set of outlier directions is and what we're going to what's going to turn out to happen what's going to fall out of Matrix multiplicative weights are is the following way of finding a set of directions.

[00:31:29]
Well take the empirical covariance and we will just look at the exponential matrix exponential of Alpha times the empirical covariance for some parameter alpha some just some number and you can always think of matrices like this so 1st of all a matrix like this you know it represents a exponentially weighted sum of the eigenvectors of the empirical covariance right there and what my my set of outlier directions what it became was the eigenvectors of the empirical covariance but not the not just the top one anymore now I'm waiting each eigenvector according to its eigenvalue magnitude in some exponentially decaying fashion this is going to be my way of choosing a set of weighted outlier directions and to get some intuition for why this might make sense you can always think of these matrix exponentials as the solutions to entropy regularized optimization problems so what the hell do I mean.

[00:32:21]
First of all ignore ignore this this term suppose that I was just solving the problem maximize over ptrace one p.s.t. matrices you the inner product of you and the covariance this would just be p.c.a. it is the max eigenvector that's not getting a lot of directions that's just getting one direction but now I can ask actually please find me sort of a high rank p.s.t. matrix This is called the quantum entropy regularize or its trace of you all you whatever its like a max entropy program then please find me a lot of directions at once such that those directions all or on average have a large quadratic form with the empirical covariance Ok so this is how I'm going to think of finding a set of outlier directions that instead of one direction at a time questions on this you haven't seen how this will fall out of Matrix multiplicative weights here but this is what's going to happen.

[00:33:12]
I mean it's. I mean it's under the hood in any majors multiplicative weights algorithm right yeah. I don't know I mean definitely when people thought about measures multiplicative weights at a higher level like when they thought about Mir descent usually you think about it in terms of in terms of the properties of the regular regular riser you're using So definitely they thought about this I don't know about a setting in algorithms where people thought about it quite in the in this way but you know it's it's not like it's terribly novel to try to find a max entropy solution to something you know this is not this is not surprising once you once you have seen these these regular answers before.

[00:33:56]
Yeah I don't want to claim this is any kind of big contribution this is all this is all just an observation that this is what happens when you when you use measures multiplicative weights Ok so the other hurdle is that we need to track our progress and measure some of the multiplicative weights it comes with this black box regret framework that lets you track the progress of a multiplicative weights algorithm that's just we're going to.

[00:34:17]
Ok so I. Want to spend maybe 7 or 8 minutes telling you the algorithm itself and then 7 or 8 minutes telling you about experiments so let's do a quick primer on Matrix multiplicative weights a quick refresher and I understand we can't do all the details here so if you haven't seen this before it's going to be a little rough but bear with me.

[00:34:40]
So this is this is an aside new setting density matrix is just a trace one p.s.t. Matrix just how terminology let me just describe a game between 2 players they will be a density matrix player and at each iteration of the game the density matrix player plays a density matrix then the other player called the reward player plays some reward matrix unrestricted just some matrix in d. by d. and the density player accumulates this amount of reward the inner product of the density matrix and the reward matrix knows that and she had to commit to the density matrix before she saw the reward matrix Ok.

[00:35:24]
Another goal of the density player is to accumulate as much reward as possible how do we measure how much reward was possible we say what is the most reward you could have achieved if you had to pick a fixed density matrix before. For all of the steps but you've got to know all the reward major sees Ok this is called the best the best action in hindsight and we call this quantity the difference between the best action in hindsight and what the how much reward she actually got we call this the regret Ok This comes from a gigantic literature an online convex optimization and others this is the max over density matrices of the amount of reward gotten by that fixed density matrix minus what the density player actually accumulated and the main.

[00:36:14]
The main insight of Matrix multiplicative weights in this very abstract setting is that it is possible to choose density matrices in a way that lets you control how much regret you accumulate you don't accumulate very much regret So here's here's a fact. If the density player happens to choose these major cities which should now look familiar to you they're the matrix exponential of some number of times the sum of the reward major cities seen so far then she will experience founded regret namely the regret is bounded by Don't worry about exactly what this is some some norm of the of the regret matrices plus some factor that's logarithmic in dimension that's what will matter the point is that you know this quantity which are priori is hard to control because you don't get to see the red matrices in before you commit to your density matrix you can actually control Ok So that was my extremely brief primer on Matrix multiplicative weights let's translate this into an algorithm for robust mean estimation so remember that this fundamental lemma it says that you need to find a set of samples that's prime with bounded covariance That's the whole goal here's our algorithm.

[00:37:27]
We're going to iterate as before throwing out samples from one iteration to the next will maintain some set of samples s. t. and it's getting smaller over time at each step of the algorithm the reward we're not we're going to have simulate a reward player and a density player and the reward player is going to play the covariance of the next set of samples I haven't told you yet how to get to the next set of samples from today's set of samples but but whatever the reward player will play the covariance of tomorrow's set of samples then that entity player will do what she must do according to Matrix multiplicative weights she will play the matrix exponential of the alpha times the sum of covariances so far and now we see exactly this set of outlier directions business showing up right this is where it falls out the density player is tracking a weighted set of outlier directions.

[00:38:23]
Now the only thing I have to do is tell you. How to find tomorrow's samples from today's samples Well what the reward player will do is take this density matrix that the density player played This is tracking a set of outlier directions there were places Ok what is the projection of each sample into these directions that's measured by the quadratic form of the sample at this density matrix.

[00:38:45]
And then she will throw out samples with a large outlier score to find the set of new set of samples plus one so this game goes back and forth between reward player and density player we're moving samples and we're hoping that we make some progress on finding a new set of samples with found covariance right the intuition is that now we're throwing out samples in a way that takes into account information from a lot of directions but did we really do it and the cool thing is that the regret analysis sorry for the giant thing of math let's get rid of that.

[00:39:17]
The cool thing is that the regressed based analysis it just totally black box way it shows you that you make progress so remember that we can control regret which is the maximum reward you could have accumulated versus the reward that you did accumulate right plus some you know some terms that you can down but I'm not going to worry about them for now if we substitute in this is this is from the abstract setting but if we substitute in what we really did in our instantiation of this of this game we said the reward matrices they were the covariance matrices of of the samples right so let's put the r t's to be the covariances so that on this side the sum of the r t's That's the sum of the empirical covariance is over each step the max over density matrices is just the spectral norm of that sum.

[00:40:05]
And what we're getting is that the spectrum of the sum is bounded by the reward that was accumulated how much reward is accumulate it's the inner product of the density matrix with tomorrow's covariance matrix Ok so my goal is to show that one of these covariance matrices has small norm if I can show the sum has small norm that's good enough.

[00:40:27]
So why should this be small right well what's the intuition here how do the reward players get tomorrow samples from today's she said let me look at the directions specified by u.t. specified by todays density matrix and let me throw out samples that look like outliers according to u.t. those are exactly the samples that have a large inner product with u.t. and they were thrown out to get the next set of samples so that should mean that my new set of samples has small inner product with u.t. And that's exactly what we proved so the key lemma which once you know what you need to prove it's not hard to prove it.

[00:41:00]
Is that if you throw out samples in the right way you can cause this inner product between today's density matrix and tomorrow's samples to be bounded by some constant less than one times the covariance that you started with. That would be one Yeah I mean that's one way to interpret it we're not doing that directly but what.

[00:41:28]
I mean in some sense the good strategy there's a good strategy. Yes that's that's an interpretation of it but that's not well that's not the avenue that the proof takes directly at the the avenue the proof takes directly is to control what happened in the interests to control the amount of reward you can run yet.

[00:41:47]
But yes that's that's right. Yeah. Yeah so you need to choose Alpha to balance these 2 terms basically and in the end we will use I think one over the starting empirical covariance because that the scale of this whole thing should basically be the starting empirical covariance and you want you want this whole side to be like and it being like one half the starting empirical covariance at the end of the day.

[00:42:34]
And then you'll iterate this whole this is some iterator in or iterations this gives you an algorithm to reduce the spectral Orm by by half and now you run that outer thing like log times. And you'll get the spectrum down to one. Or. 0 Yeah. That's that's one perspective like I don't know anybody that anybody actually wrote this down but you can easily use the ideas in these 2016 papers to show that some basic s.t.p. solves robustness evasion in the setting makes sense so in some sense.

[00:43:18]
If those SEPs had been in the literature already which they should have been you could view this as just speeding up the naive solver for those s.t.s.. It's a bit non-trivial in this case you know I've swept under the rug things remember there was this issue that it's not really a convex problem you know there was the centering thing and so you have to worry about this.

[00:43:37]
But at a high level that's that's what's going on Ok so as I said I ignore that you have to deal with this these terms you know at the end of the day you need to sort of understand how these special norms are behaving but you know it's not not it's not a terribly big big deal so the point is that this algorithm reduced the spectral norm of the empirical covariance of my samples by a life some factor of 3 quarters and if I run this log many times and maybe I do some naive thing at the beginning to make sure that my starting spectrum was at most some polynomial in d..

[00:44:12]
Then I'll be in good shape then I'll find find a set of samples with small empirical covariance and I didn't tell you how many iterations you have to run for but basically because this dependence is logarithmic in d. And if you don't multiplicative weights before you've seen this fall out because that dependence is longer than they can be you'll end up running for a log log the rounds so of course to turn this into a fast algorithm you also have to make sure that you can implement each iteration of the algorithm fast rate you have to be able to compute these covariance matrices Well these are the major Cs you're not even allowed to write down a d. by the matrix you have to do some kind of sketching this is all pretty standard stuff so we use standard multiplicative weights tools to speed up these iterations to nearly a year time although who knows maybe we're cheating on the spectral computations.

[00:44:58]
Ok so in my last few minutes just a couple of minutes kind of the whole point here zooming back out was that we wanted to leverage insights from these polynomial time but not quite practical algorithms to get something we could run. And so we you know we were very proud of ourselves like we didn't appeal to black box s.t.p. solvers we did something that just seemed like it requires some special computations so we can implement and we can see what it does.

[00:45:25]
So we we studied in our experiments not quite robust mean estimation which is sort of a little bit too stylized of a problem we study what we think is the less stylized version which is just remove the outliers from a high dimensional data set Ok And instead of even talking about removing the outliers I just want to talk about taking a dataset and this assigning scores to each of the samples right so that and the goal being that samples with high scores are outliers so think of some you know you're a data analyst you get some messy uncleaned data set and you're hoping that you have some black box tool that plots you a histogram and I can just chop off all the points that that wound up far on the right side of my histogram and then I'll get you know some possibly clean data side that's that's the very optimistic goal.

[00:46:10]
So we're going to we're going to. Try to we're going to run a bunch of different wired Texan algorithms that all do this they all assign a non-negative score to each sample. And our metric of the quality of our scoring rule is going to be this is actually standard in the literature on this stuff it's going to be the probability that a randomly chosen outlier in the data set gets higher score than a randomly chosen in liar so high numbers in this metric are good I want all the outliers have higher scores than all the in layers.

[00:46:37]
Of course to evaluate this metric you have to know which are the outliers and which are the end liars which makes things a little bit of a mess but I will say how we did this. We did this on a few data sets we have synthetic data that of course you can make your algorithm work really well in synthetic data we also have some datasets coming from word embeddings and from 10 images and I'm going to focus on these on these images.

[00:47:00]
So for those of you who don't know c 410 it's actually one of the standard image recognition datasets we're not doing recognition we're just doing outliers but we just want to use some standardized set of images that is easily replicated all. Ok so this is a refresher on the outlier scoring technique that came from came from our algorithm right this is just what falls out of multiplicative weights we're calling it quantum entropy scores which is ridiculous.

[00:47:25]
The idea is you get your data set you compute this exponential sum you don't actually compute it you do it implicitly in some way right and then the score of any. The score of a sample is its quadratic form at this matrix which is an exponentially weighted sum of its projections on the eigenvectors of empirical covariance and if you do some sketching tricks you can compute all of the scores approximately with in nearly linear time because you can compute all these things fast.

[00:47:56]
Before I show you some plots and will be done I just want to make the disclaimer that of course we understand that no single outlier detection algorithm is suited to all datasets you know there are many different kinds of outliers. Are Ours is of course well suited to the situations that we were inspired by which are hard cases for robust mean estimation I'm going to show you that there's at least plausible looking data sets where this is sort of looks like a hard case for us me estimation but you know I don't want to claim this is some some panacea Ok so what do we do.

[00:48:27]
Here's how we created outliers or you got to know what the outliers are and basically by the way I think in any experiments that were run so far on large scale data and high dimensional data they always basically introduced outliers in some pretty contrived way like literally just sticking you know a cluster of galaxy in point somewhere in space so I think by that you know we're doing better than that even though we're not doing that well creating realistic looking outliers Here's what we did earlier see for 10 images and to create a cluster of outliers what we did is we pick a small set of those images we pick a random pixel we pick a random color and we set all those images at that pixel so that color Ok so this is like you had a dead pixel in your camera on the day that those that those pictures were taken so here we set this pixel to read.

[00:49:11]
And then just to you know have all the cards on the table we also cheated in the following way. It turns out that seats are 10 images they're not isotropic they're covariance it's really far from identity but the sort of the whole point of our algorithms was that they work well when the good samples have to have identity covariance or are close to identity covariance And so what we did was we assumed that you had access to some distribution of samples that was kind of like your good samples so in the case of c. far you know your distribution that you have access to is maybe like the dogs and you're trying to do out there detection on the airplanes and what we use this for is just to find some approximation to the covariance of the good samples so that we can i finally transform space to get the good samples to look a little bit closer to isotropic otherwise warehoused Ok so we so we have we do we do preprocessor dataset a little bit but this is now we're being honest this is literally what we did.

[00:50:04]
And the 1st thing you might want to check is how do you compare to p.c.a. And how do you compare these naive spectral scores so here applauding the scoring performance. I guess normalized by this performance of p.c.a. versus the magnitude of this regularization parameter alpha so with no regularisation.

[00:50:22]
You have you have p.c.a. this plot this scales reversed Sorry no with with no regularisation of p.c.a. with infinite regularization you just have scoring by all 2 and you can see that we do better with some non-zero amount of regularization right so regularisation helps and even as you would expect the more distinct directions the outliers are in the more it regularization helps you and we see this if you create more distinct clusters of ours is like choosing 10 different subsets of images and then for each one a pixel to corrupt you see more benefit from.

[00:50:57]
From regularization more benefit from from this quantum entropy source. And finally the other thing we did was we compared against a bunch of baselines from a psychic learn. And you know we basically this is again the number of distinct clusters of bad images versus scoring performance and we're on top of course as I said you can find easily datasets where this is totally reversed and the existing methods do better but the point is to complement existing methods right you want you want to say there's some class of data sets where you do something that the existing methods couldn't and I think that's what that's what we accomplished Ok so.

[00:51:35]
With the 1st nearly linear time algorithm for a bust mean estimation that achieves dimension independent error rates. The key insight is to use these exponentially weighted. Variance nature sees just the outliers and are algorithms implementable we can get experiments on you know on this laptop in thousands or tens of thousands of dimensions Ok So thank you all for your attention.

[00:52:06]
Yeah we're. In the running time and. Yes that epsilon is not the hour epsilon because we don't we don't need to solve to additive epsilon. We need to all we need to do right is. We just need to reduce the empirical covariance by by half so we don't we don't need this side to actually be epsilon this side just needs to go down to sort of half of it starting value so like the with the with this constant of our of our program.

[00:52:57]
Yes. Yeah yes and you do you get the at us and you get the I finally independent. So. I don't know it's a good question I have no idea. Yet you would like to be able to do. This and I guess the s.t.p. does this right this is just the normal as you so you would expect that you could do it.

[00:53:37]
Yeah I'm not I'm not exactly sure you know you certainly have to actually I don't even know how to do this does the does the naive filter do this just the regular old spectral filter without some modification. So my very dearest I guess would be that then you could reanalyzed this like this algorithm should somehow only be better than the spectral filter.

[00:54:03]
But I certainly didn't do it and I don't often over my head how to. I see what you're saying. But you you know that you should still have to do that because the problem in the experiments should be that the outlier directions are actually dwarfed by the high variance directions of c. for 10 right and then and then you're probably screwed does that make sense Ok yeah.

[00:54:44]
Yeah it's it's a good question I mean. Somewhere somewhere in the middle so so no I guess I guess I don't think we have the right set of real questions to use this stuff for yet. I think there are. There are serious outlier detection problems that it would be good to solve in practice and actually at least Jerry especially has been talking to people in biology so a real problem is that these high throughput sequencing gene sequencing machines.

[00:55:20]
They have some fraction of garbage reads so these things that they take they take like a fly genome and they chop it up into a 100 base pairs sequences and then they can sequence these 100 things extremely fast when you stitch them back together. The problem is that you know one percent of your 100 base pairs are just crap and my understanding is that at the moment the approached if you want to sequence a totally new genome you don't have a reference genome the approach to dealing with the garbage reads is like by graduate student you know they like sit there and they get rid of the battery.

[00:55:54]
And so this actually is a would be a serious problem. The issue is that it's really not clear that it's. That it's captured in any of this set up so we were sort of thinking for a little while about that. But I don't know how to translate the setting the problem is that that's really about sequence about sequences.

[00:56:13]
And this is really about geometry right somehow in less and less you were able to translate the sequence data into you know some dimensional problem it's not clear make sense. If you have if you have a problem we can solve you know I'd love to solve it. Thanks.