For inviting me I mean I see a small audience so please interrupt me at any point to ask questions so this is the first time I'm giving this talk this work has evolved over a number of years but we wrote it less than a year ago it's a very long paper journal work with Daniel Kane who is a colleague at San Diego and I was the steward who is my post look at the U.S.C. previously at Adam or OK So there are two terms in this title and one of them is unsupervised learning the other one statistical query lower bound so I'm going to explain those terms to the introduction. OK So this is the structure of the talk so after a brief introduction statement of the results actually not so bad it's probably going to be half of it though where some idea of the lower bounds some idea of how the prove works and then give you some extensions I'm going to only present subset of the exultant I'm going concludes because that's all it was begins so what is unsupervised learning the motivation is what the motivation is to have a fear of populating It's like learning of billion functions in computer science theory very successful it doesn't couple it doesn't cover the setting where the data is on labels so you see let's say some picture these are points on the plane you have no idea. What they come from the goal is to discover some hidden structure in this unlabeled data if it exists for example who can guess what these points are. OK So they are this is a mixture of the cultures in the plane to scour I do it all right but you know if it was a Santos maybe you wouldn't see that this is what it was all right so usually the. Idea is that we make some assumption about the model that generates the data so there is some kind of probabilistic model generating samples and we have assumed that the data generated from this model what we want to do is actually fit the data to the model so find the parameters of the model so this is the answer provide learning from a very generically. So the input is a sample generated by a model in a given family service model for some unknown parameters for example a mixture of Gaussians houses the mixing weights the means and the variance it goes yes and the goal is to find parameters that are close to. The model I'm going to find this more formally related specific questions we look at now there are two questions the first one OK Is there an efficient algorithm so is it a computational efficient algorithm to solve this generic learning problem at the more refined level. There are three criteria we care about at least three so that of the three I care about for the stock one of them is the sample size so how many samples do you need to draw also the formation of the earth will you have enough so you can actually constructed the model approximately second is the one time I can to design an algorithm that Ransome time polynomial the size of the sample the second one is what happens if your assumption about the model is not exactly correct so in fact the samples do not come from the exact model you assume but some neighboring model and the broader question to ask which are there are ization of question one is out of their inherent tradeoffs between these criteria can we achieve the best possible value in all three of them or you know if we sort of want we want to be small though how to increase the other is there a trade of curve so this is like a very generic type of question and that we're going to to instantiate it with specific specific problems so this is the hour supervised learning from isn't clear to everyone. OK now we'll talk about started learning So what is it it's a restricted model of computation it's a computational model for learning algorithms and even those there is thirty is quite powerful so what does it do so usually in a learning algorithm then put the set of samples let's say from a distribution so this. Samples are given to the algorithm valid doesn't competition solve the problem in the statistical query model is defined by currents in the ninety's in the context of tackling global or functions we do not have direct access to the samples what we have you have an oracle started to cull quickly Oracle who does the following so we choose somebody gets a free one on the domain of our distributions is going to be a boundary periods and the Oracle gives us an estimate of the expected value of the one on the distribution OK So so this is the true expectation of the one of X. where X. is the one from D. the Oracle gives us an estimate of the one which is started it all writes So what the algorithm does the to adaptively chooses queries like that. There is interaction with the oracle and eventually it gives the answer to any problem. Now there are two parameters of interest here one of them is what is the accuracy of these of every statistical quitting. This in some sense their presence how many samples units because recall that if you have a bounded random variable you want to estimate it up to an additive tell you need one of a task or samples to do this in general. So basically this AM which is one of a task or something we need to play when we talk about the complexity of these algorithms and the second is the number of quiz OK So let's say Q. is going to be the number of queries and M. is going to be the inverse square of the accuracy this is how we measure the complexity of these algorithms so it is not obvious to see how it translates as that this is under them into something algorithm and what is the best possible let's say something complex because the quicker adaptive people have worked this out it's not. Easy but it's clear that if we need every query to have accuracy let's say at most how we need to pay at least one of a task where even for a single quick OK and Q. the number of queries kind of represents the running time of the algorithm so is the definition of the Model T. or. OK So this is a bit of a nice small but is restricted OK So we have a hope of proving computational or bounce but very powerful so this is not a proof but it's like an empirical observation like a wide range of algorithms and algorithmic techniques in both supervised and unsupervised learning can be implemented the in this model or the Sigma study time here. All right so essentially essentially every algorithm we know can be implemented in this model so this includes standard the tackling global and function algorithms and techniques and not on supervised learning cells in particular we can implement first order and second all the methods for the class to come out of the musician moment methods then store methods local search etc So it's a powerful model. We don't fully understand its. Power but at least since it's restricted we are able to actually prove unconditional lower bounds for algorithms in this model I should say there is a single algorithm which is the only known algorithm that cannot be implemented in this model so would you guess which one of this is. Which one is it. OK so gushing version of a finite fields that's important can not be implemented in this model is provable and for example this is what you would need to use to learn parrot is OK so you can not learn this in the model of the sensually this is the only known exception. All right so this is the SU model now since it is a restricted model as I said. There is a known methodology for proving or bounce this is not easy to explain in the context of a few minutes but there is some notion of that the sickle quit dimension which was defined by currents in the ninety's for pattern of functions in a more recent essentially Georgia Tech paper it was generalized by. The doubt for a general problem solver decisions to this includes the answer of revised its mission probably is going to talk about today. So roughly the idea is our lives notion of correlation between probability distributions like some notion of the going to explain this at least at this point AND THE FUCK IS A to prove a lower bounds it's a first to construct many distributions in your family that are almost uncorrelated roughly speaking the number of uncorrelated this edition is going to correspond to the number of queries that are next to all those means and the degree of correlation of these almost uncorrelated the figure is going to correspond to the accuracy of the statistical quittin OK so this well known so whenever your presidents do such a problem it's a fine just to construct an instance that has a large number of almost uncorrelated the submissions. So far so good all right so this was the introduction now let's go to the results so the first part I'm going to talk about OK so this is the sort of OTHER one thousand of the talk show biz in the general technique to prove a lower bound started to Laura bounce for a range of high dimensional learning tasks more general estimation tasks in particular the stock This involves some kind of high dimensional Gaussian distribution inverse So this is not necessary for the technique what is necessary for the talk so cock it up with patients who get lower bounds for all these problems so learning makes use of Gaussians robustly learning a single share of testing a Gaussian and some kind of statistical competition at I don't. I will only mention towards the end so let me explain each one of them. OK let me go here make sure this is one of the most well studied the problems in the intersection of learning at the C.S. and because it would be confusing the little bit of using when you are not part of the of the area I'm going to give you a sex in summary of what is known so far so OK for people who are not in the said I want to make sure of this is visions of commercial national distribution so there are some weights W.-R. you have a components. The weights are other themselves probably this is visions of intuition as a probability W. one would take samples from the second component and so forth so a mixture of Gazans is a mixture of K. components whatever component is Gulshan Now what are the parameters you need to learn such a decision you need the mixing weights so these are K. numbers that sum to one unit the mean vectors of the gauchos within the covariance matrices so this and mentioned problem and this an answer model hasn't started since the beginning of that this looks and in this year's people since the ninety's have been designing learning algorithms for it's so as I said they literally because using So let me try to clarify it so there are two learning problems two related problems the most well started one which is basically what the study stations were doing in the beginning and that this is people in the past decade this is called parameter estimation there are the goal is to recover the true model parameters so literally you want to find approximations to the W.I. as new eyes and seeing my eyes and there are two regimes to study this problem the first one is when the components the individual components are separated and papers will not be finally what separated means because there are many different definition by roughly speaking it is some kind of condition that ensures that the components are very far from each other in particular the overlap is very small or I or a quiver. Total violation of the senses close to zero. So close to one. So this is a picture that's been a multitude of algorithms but the work force applied composed of the strongest one as far as I know is another by brute breaking of a power that is able to essentially. Essentially resolve this case for at least two components not quite but there's literally like the strongest thing that we know so this is a family of algorithms number one of these algorithms are usually based on spectral methods. And version of that spectrum method suffices because you have the separation between the components little you can some kind of clustering now the sample complexity of the problem of the problem not of the algorithms used in this case for separated Gaussians is polynomial in the dimension and the number of components in the algorithms that are known under those assumptions run in polynomial time now the more general version is one we have no separation assumption or more formally We have some small separation in particular we need but at least there is some small Absalom separation between every pair of components otherwise you cannot distinguish them so in this case you need to use moment to moments that has been a line of work on the initiated by a collider motor and valiance in the situation is more complicated so even in one dimension when you have very large overlap between the components the sample complexity of the problem can grow exponentially with the number of components therefore the running time to go exponentially about nodes that the sum of complexity is polynomial in the end and some exponential depends of the separation of the number of components but their only time of known algorithms has admitted the game which is not necessary so even for this case. All right it is a problem number one other questions gamma is some parameters of the quantify is the overlap between the components roughly speaking after a polynomial factors in the dimension it is basically the. Total variation distance between between pairs of component of a the minimal that way suppose you have two components of the television difference actually OK so you need many samples just to see samples from the area where they differ OK Nick of K. of them it's difficult to know that's why the problem requires many samples. All right so this is roughly speaking what my understanding is for the bottom of the estimation problem and four and there is a second problem as I said which is it would be easier this is just then System nation so your goal is to just approximate the underlying this is vision just output any hypothesis. It could be a mixture of Gaussians for all we know that is within thirty skull distance epsilon from the true of the long the solution so this is an easier problem in no very well defined sense in particular the sample complexity of the problem is polynomial in the dimension and the number of components without any assumptions you don't need to assume separation about the components for this problem to be solvable with the few samples on the other hand the best known algorithms even for this is a program are essentially the moment much in algorithms from their previous lives and again you can see that the running time of those others who grows exponentially in K. and the strong exponential is not to the gates the to the K. So the question is if this is necessary and this is going to be the first so are there questions because I saw how Hans hit. Yes Yes Yes So K. is known here so we want to feel a mixture of the dimensions to the data. If is not known as much harder so even in this much easier version of the problem is something that we don't know how to do properly. I mean suppose you want to do this. System Measure for Measure of a one dimensional Gaussians then it's one of action to take a if you want to do the parameters to measure your closer you want to density measurements give up some square samples. If you do some both I mean if you box you no longer the optimal you can do you can fit some kind of piecewise polynomial to the data and you can give the optimal sample complexity and polynomial time. Yeah so even if even for Gaussians even if you wanted the proper learning algorithm you can beat what parameters a problem of their estimation does in the series but that would be a proper learning algorithm opposed to the parameter summation would give some mixture but not necessarily the true one all right. So an important facts. Is that when the components are separated them then system measurement parameters measure equivalent in terms of sample complexity therefore nominal min samples are in both the and K. suffice for both the stalls or rights so the question now becomes OK so we know that there is this dimension can be sold with many samples for normal in the endgame parameter summation can be solved with the same number of samples if the components are separated which is a reasonable assumption so is a ripple no matter Margaret and so is our knowledge of for either of these two learning problems that actually runs in time polynomial movie some complex last question number one in the answer is no assuming that this is not the sickle good algorithm so this is the first theorem so let me read it for you so basically says that any statistical query algorithm that solves the learning problem for me should separate Gazans mixes of cases by Gazans in the dimensions needs to have the following characteristics either every start with the gal query needs to come very high argue or say something like due to the minus game or the number of course would need to be exponential in the dimension. Or rights. So apparently theory means that. The total variation distance between every two components is going to be very very close to one and we can make it something like make one manageable of a racial difference to be won over poly and gay or even anything so much financial work so like literally like no other love and use of the example it's. So what is the conceptual message here is that even though the sort of sample complexity of the problem for such instances is polynomial the computational complexity of any such algorithm needs to get all exponents Challis in the dimension of the latent space of the length and spaces the number of K. are known. Variables K. weights so we need to be exponential in King So there is somebody force is necessary yes. O.. So the accuracy is basically. Yeah yeah yeah OK So basically you have a started all of those allows you to estimate expectations of bounded functions from the distribution should you estimate its each such expectation after another to tell the Stars your accuracy OK It's like a measure of the sample complexity Don't you need. All right so this is so is a question of the fear of. So you can you parse it. So there are two cases right so either you use queries or cover this accuracy but if you have this accuracy it means that just to implement a single query with samples you would needs due to the case samples and therefore do take a time to read your samples you have to actually implement a query to do this or the number of quiz not be very large. Which basically in both cases you get a lower bound on the running time of social work. Yes yes it's even it's even worse. So this is also a tradeoff that if you accuracy is small so if you use essentially a few have a visit many samples the running time is going to be much worse yes. OK. I'm going to move on to Problem number two now so so far we've talked about learning a model under the assumption of the data comes from the model OK as the samples are going to be all on a samples from a mixture of K. Gaussians and even under this assumption you cannot get full normal time at least in the number of components and I'm going to talk about a seemingly innocuous problem that is not so simple so very high level motivation as you want to design robust algorithms algorithms out succeeds even in the presence of a reasonable fraction of corruptions in the data and the model of corruption I'm going to use like what would naturally you would naturally expect an agnostic learning so there is some family F. of distributions let's say all high dimensional Gaussian is the single Gaussian and we get samples from a definition of prime which is absent on close to some unknown distribution in the family OK first of getting samples from a single guy from a girl should get some of this which would such no closer Gulshan and then you want to learn this prime. OK So the most basic version of this problem is the case of the family have can see. Of unknown Memnon covariance say identical virus Gaussians So the question becomes suppose you get samples from an epsilon corrupted version of and a new i the only unknown parameter is the main eventer find any hypotheses that gives you the variation distance all the Absalom from the truth all right because of trying to goal to get into the variation distance epsilon from a new i is the same thing as getting to the variation is over some form of prime or a but I just want to approximate the truth the decision before the corruption of this why I'd like this but doesn't really matter I should say of the reasonable resisting one although actually because this is the answer of the information Theoretically this is the best accuracy we can achieve even if we spend exponential time and there are various rooms done much more natural timeouts evacuation of Absalom So again actually the fraction of corruptions or the little variation distance between the model that we observe and the very true family of models. OK So this is like a. Trivial problem right so you would do some kind of remove all and you remove some obvious outliers and then out with the empirical meaning this would work so it's not exactly the case so people come for about this in the sixty's in. Statistics and the best of it was known in terms of computational efficient estimators were algorithms that take care of epsilon which is words you could achieve and blowed up to absolute times or dimension. All right so for example you take like you do now approaches like empirical media or Joe might have made the obvious what you would get and very recently two independent groups one of them by Santo's lie and post on Oprah and the other one group of many people involved myself we got an answer an answer that's much better than the NIH and the naive answer of actually. Little D. So roughly speaking these odds are the same more or less what we can get in terms of the variation of the senses epsilon which is the best you can achieve informational theoretically times the square root of the logarithm of one over epsilon two completely independent of the dimension and almost linear given that this result is there one could imagine that OK this is like a matter of time until this becomes Absalom right there has to be a better algorithm. So this is the summary so there is very very existence a polynomial time Mulder of the robustly learns M.U.O.I. or absolute log is there an algorithm that is better than that anything better than absolute school looking ideally of Absalom and the answer to that is no Again no statistical choreographing can do that. And this is the fear and so they feel themselves that any asymptotic improvement over this absolute log needs to have running time that scales exponentially with which is this multiplicative improving for example with and would need to depend somehow on Absalom to get to get a. To get better than the actual long. Therefore this would be reflected in your running time so for example if you wanted we have to be over and would need to be squared log therefore there on time of S.Q. algos we need to grow quadruple normal with Absalom grades would be to the Polaris was going to be due to the poly log one of the yes. So this is this is the sort of account for as they're running time of the algorithm yet in the previous in the not result yes you can actually get an algorithm across not dependent on the dimension it's Aragon T. and the ransom polynomial time and now the question is this is a good result OK The question is if you want the optimal robustness of the information further goes on about you cannot do this with the problem of the model but depends on absolute need to go so far polynomial so this is basically some kind of. Gap between what can be achieved information of theoretical compassionately for this class of algorithms OK is a theorem clear and I generally take away is if you want optimal robustness at least in this model of corruptions it cannot be done. And the previous algorithm the no now going is in fact optimal among all of the running time Pauly D. of Absalom So this is a bit less sort of clear than the first Once you have any questions please let me know is it. Clear to everyone what the what the statement is. OK So let me move on. Why why is the case. Well you know you covered this revision which is. Let's say. On one of my reception a fraction of the domain is gushing out of the action is completely arbitrary so you have no really no way of but addicting what this absolute fraction could be so if the only thing you know is that you absolute loss the Gulshan no matter how many samples you take in no matter how much time you spend you just cannot get ever by the National so that our information theoretic arguments similar to what people were doing an agnostic learning in the ninety's that shows that you can have this illusion that identical. So noisy of identical and have to talk about Asian descent so mad over between them and therefore it's impossible to distinguish between them no matter how many samples you take. All right. OK So let me move on to the first result which is not about this guy who algorithms about it's about some complexity is information theoretic and there isn't I mention I mention it is because it follows from the same technique so that the same Laura bond construction that I will present later of a gives all these to Laura bones it also give sample complex lower bound for various problems of what open and there is no mention of this particular result because it has a specific conceptual meaning so now I'm going to leave learning for two minutes and talk about hypothesis testing so this is again an answer problems that he thinks you have some null hypothesis some old alternative hypothesis you want to distinguish between them so the question is common subs you needs so there is a SAS problem dimensions is distinguishing between a zero mean gosh I'm in a lodgment Gulshan so the null hypothesis is that your distribution is going to be an zero I and the alternate hypothesis going to be an M U I and there is a large house larger to normal I need. To assume that the sound in this case there is like a large old gnome otherwise the two cases are not distinguishable from a shelter already be OK now one way to solve this problem is just learn which would need something like the over the square samples to just learn a normal Gulshan of this would work you can do better there is a very basic algorithm for this problem that use the square of the samples so you can do testing with fewer samples so than what you need to actually learn the underlying distribution quadratically if you are cells OK. So that's clear now what happens if we add robustness to the problem in particular suppose that in the soundness case your distribution is not going to be new I was going to be a delta core after version that annoys version of an U.I. where the old is much smaller than apps and let's make that out to be action over a thousand so that still there is enough gap between the two cases you can actually information theoretically distinguish how many samples you needs so the answer is the sum of complicity jumps from the TO D. jumps quadratically. The right sort of Boston is actually in this case begs the problem harder information theoretically. OK I think that sits in terms of in terms of the presentation of the three main results. I'm going to jump to the techniques now so there are questions this is a good a good point to ask them. OK so how do we prove all these things so there is a generic technique to do that so we will construct they're an explicit family of instances that are actually hard for S.Q. algorithms and also for the corresponding testing or bounce so how does it work because the steps. So essentially what we're doing is we're constructing this abrasions that are going to be standard Gershon's in all directions except one. Place or there is a special direction in which you are something the projection in the direction of your distribution is some one dimensional thing with nice properties and of the orthogonal complement you are a standard Gulshan OK let me be more clear so we can start with a solution P.V. which is parameterized by a vector of it's a unit vector representing a direction All right so that you know the projection of P.V. in the direction of be going to be some distribution we're going to call it a for the rest of this talk and is going to be a one dimensional distribution that agrees with the Stars are one dimensional Gaussian in the first and moments M. is also a parameter of this construction and then the hard family assistance is going to be the set of all previous for all the for all possible unit vectors so the question would be I give you an unknown distribution from the from the sets find the hypothesis which is close or right. For all possible for all possible directions Yes you take basically literally all possible vectors in the unit sphere and these are like your it is like an infinite family of instances you can you know it suffice to get the Nets in a certain sense and I'm going to explain that when I go to the got so is there a high level goal clear. Yes So the difficulty is to find this kid in the direction OK to find the hidden V. and little to learn to find the V. which is very close to the true direction this is hard for us to our. Now the number of queries that we need to solve that's a problem one of these assumptions is going to be stated clearly in the proposition which comes to slides All right so now what is the idea like so there is this heat in the actual distribution which is what I just said so what is this formula So P.V. of X. So this is a probably the density function of the underlying distribution this is a vector X. is a veteran in the dimensions so the projection in the video in the video action is a is a one dimensional density and in there asked you get a one dimensional Gulshan OK in the orthogonal complement So that's the formula you get an example. By the way I should service approach of this equation I say but support a decision in a nonstandard coordinate system in the direction of v c call to a in the orthogonal results in zero one to the demands one or as an example suppose that this is a so this is a mixture of K. let's say five one dimensional goshes are far from each other so what do you get by applying this operation so what's going to be the P.V. So P.B. is not a good look something like that OK in two dimensions so this is the direction of the and these are basically the level sets of your two dimensional density alright imagine the rotating that you would get different types of watching two dimensions seems easy but the very issues I think Heidi mentioned these things are difficult to distinguish and let me explain intuitively why I like it's actually difficult to solve this problem. I guess so I have a have a bushel process so what is there figure they feel themselves the following This is not the full theorem there are some sort of sums of them Heidi you can ask me if you want suppose that the one dimensional decision a one dimensional projection much is the first ever moments of the one dimensional star Gulshan and you have a net but whenever you have to vote. We prime the unit vectors that are nearly orthogonal which means let's say very inner product is less than let's say one tenth that is a violation distance between the pairs of P.V. and P.V. primes going to be larger say two Delta. Then an E.S.U. algorithm that learns an unknown P.V. within add or delta so two Delta is a separation of the net if you want to achieve half of that you needs either very isolated queries either queries or vacuously due to the minus am dimension to the minors number of moments or exponentially many quit so this is the general proposition and for all the problems I mentioned what we do is we instantiate a construct and they will be a problem with proper properties so that this thing is also satisfied and we gets our bounds all rights. No because a much is the M. but I'm going to take is going to be small enough so that the Ovation this is going to be constant OK so in particular OK if a is a mixture of K. Gazans I'm going to march to K. minus one moment I'm going to be very much like to I don't know three K. is going to be close I'm talking about the actual I'm going to much much less so in the one dimensional setting for this problem you must be very far from a Gaussian but they show what happens when you OK. Well I think this is like the difference from the information theoretic stuff of the actual you actually got some it was close. All right so this is the propositions I'm going to give you some idea of how to prove this to actually prove this you need all this mushing little statistical queer dimension which you know if you don't understand becomes a little bit tricky but at least there isn't a ration here why is that hard to find a this should then direction in the first obtuse thing is that you cannot just blindly use moments if you use moments since a much is the first moments of zero one by the definition of P.V. OK all these pieces are identical to an zero by the identical standard gushing high dimension in the first game or minutes so literally you need to go through the employers one degree tensor to see what's happening in very in addition to the fire just too many entries you also need to estimate those answers accurately so this thing is not going to work another thing that people use is random projections OK what happens if you take around the projection this doesn't work either and it's going to be a clane I'm going to explain this so literally to be able to solve this problem with under projections that will distinguish between P.V. and zero why you need to look at exponential in many directions. OK. So this is why some high level intuition I'm giving let me give you the key Lemma of what we actually proven this is this is something that that's essentially suffices to prove the escalator bounds so it doesn't say so let's look at the distribution of the projection of P.V. in the direction of the. I am so very Some of it actually primes some other direction when I claimed that the car squared distance between these two distributions Q. and zero once accused one dimensional it's a projection of the high dimension thing in a specific direction is going to grow. It's going to go to zero exponentially with a number of much moments as long as these veterans are near your phone so if this inner product is notes. Like yours like small like smaller than half this is going to go to zero very fast and in fact for us this is what I was going to go down with a dimension is going to be something like one overall dimension so how do so is a statement clear So basically this says that. It will take V. which is a figure that actually take a random direction of the prime you expect or the inner product between them is going to be one of a two D. so sure the project is going to be extremely close like due to the I'm close to like one of a bit of V.M. close to a one dimensional Gulshan right so the consequent distance between A and the star gushing could be infinity for all we need that it's finites and in any case doesn't depend on the dimension therefore for most practical purposes we can't read this term as a constant. All right is just. So this is the assumption is not needed at all for this lemma it's needed later so this is independent of I mean those assumptions about the sort of net you needed because OK if it's hard to distinguish thing that are very close to each other there's a mother they have to be far so valuable about OK OK I'm going to do some pictures that took me quite a while to construct how do we prove this thing. OK So this is very some vector of the prime is the other one let's see there is the angle between them what do we know we know that's in the direction of V So let me call this X. What is the distribution of v X.. By definition it's a right. So if X. is from P. ri then we detect follows A and then the orthogonal direction let's say we are in the plane otherwise there is a reaction coming outside and we're still going in that in the orthogonal direction. Here and zero one OK So on this plane. Your previous a product with addition of this form. OK now I'm going to take the orthogonal axes corresponding to be prime then the called them experiment Y. prime. Then and there what I'm interested in finding is what is the distribution of the prime the tax where X. is drawn from P.V. let me called with the mission Q. and my goal is to show that this Q. is very close to zero one in distance and the closeness is this going to cosign of the two the two implies one. Is a statement clear is a picture helpful so how do we prove that well. Since your distribution P.V. is a product of submission on this plane to actually calculate Q. of X. prime which is the projection of a sample that don't hold in the direction of experiment just integrate over Y. prime OK and any of X. times you have Y. is just a problem with the dentist. Function of T.V. on this plane because it's a problem with evolution with respect to X. and Y. axis all rights and then the only thing that we need to use is a relation between the coordinates X. Y. and X. times extrem Y. prime so it gives us this formula right so this seems doesn't seem like much but succinctly usual So in particular what happens if the angle feature is ninety degrees. We actually go to the previous picture what happens when the angle fit is ninety degrees then certainly the projection is exactly right by definition OK So the point at this article is almost ninety degrees the vectors are almost orthogonal going to close the gusher and this is very much as a way to quantify that. OK So we have a formula for to express which is the solution we're looking for so this is in a well defined sense and noise version of the distribution a. This is not an informal It literally Q X prime is equal to you feet away were you think I said is the knowledge OPERATOR OK It's a well known operator in a much more gentle a setting that is one dimensional one so it takes a function F. and maps it maps it to this other function so the idea of this knowledge of it or at least all of that operator was not how you call Gaussian knowledge operator in our field. Is that it's basically smooths out a function it makes it closer to a Gaussian in a well defined sense. All right so just by using properties of this or better a door and the fact that the projection is the operator applied to a we actually get the lemma So do you want to see the proof or are we out of time. We have five minutes and then we end. OK OK So I want to give you the pull of them but I'm going to give you like what is like. No no I mean it's better that than I so but it would very quickly so we know exactly the properties of the pen other than we need we know what are the ideas functions and the idea in. The eigenvalues So literally we can express am the basis of these functions and then get the basically a of X. in this basis because no terms between degree one and M. just because a agrees with a Gaussian in the first moments. And from that is. It is to get a formula for the chi square this is between eighty and there's going to be the sum of this question of the score fissions movies formula by we are going by we are going the composition of the operator we know exactly what the noisy version of the operator is and finally were able to get of the difference going to be small OK it's literally a bunch of very simple calculations are showing that you understand the proper judicial procedure and this is the lemma. Not given this type of Lemma it's very easy to actually prove our final goal which is another proposition that proves the escalator bounds so the idea for that again as I said is we just need to construct a large set of decisions that merely uncorrelated for this notion of correlation that the people before us defined put all of that suffices So there are two main ingredients so the first thing you need to show is in fact that if you take two nearly orthogonal vectors be in view prime then the correlation is going to be small and then you need a packing argument you need to show that you can actually pick exponentially many vectors on the units fear that are near your father now is likely to stand for about taking random vectors so they sure as hell to prove this correlation lemma and I claim of the correlation lemma is a very easy consequence of the key lemma that relates the car squared distance between P.V. between the projection of P.V. in the center Goetia in particular the correlation turns out to just be two dimensional it's exactly the correlation with between V. and A in the noise version of A And just because his rods we can actually bounded by essentially V.. We know what we had before OK because this is within the noise version of A in the standard Gaussian and by using the Q. lemma we get the answer right so that's it. OK so now in the remaining four minutes or three minutes I'm going to give you one application OK how do we apply this general framework. For learning D M M's How do we what is the A that we needs and what is the M. the number of moment that we march in these answers Sanderson's question. So this is if you don't want to show any ask you are going for learning K G M M's recourse if you're sure at least these many queries using the generic propositions I'd look at the two statements what we need is to make them to be something like a rights. And dealt out to be something like constants or rights and the way we're going to do that is by proving this lemma that there is a one dimensional distribution which is a mixture of K. Gaussians agrees with the standard gushy on the first two K. minus one moment so it's important to have two K. minus one of like ten K. if you had many more moments matching than this and this would actually be close and therefore would make any sense the components are separated and we have the metal body so how do we construct we say. So this is the picture we use basically Gaussian quadrature to construct first a discrete this is which is going to match moments with the sun or Gulshan and then we add sort of a spiky gotten to it to smooth out the mega mixture of Gazans So that's a construction and given that this is how we want to mention now a look for this example the high dimensional previous They look like parallel pancakes so they look like this so you literally have many. Next to each other the direction V. is basically the line of the means very very well supported all the young values are one except one which is a small one which is basically the same as a variance of these one dimensional Goetia as in the one dimensional course fraction is going to be something like one of the case square so there is no dependence on the dimension and what we observe in samples from one of those. OK I should say we should see the four taken. Stu this is solvable by spectral technique so you don't need to look at higher moments and this is the break of an palaver but if you increase K. and you actually mounts the moment in this way you actually make the. Intractable at least in the number of components All right so that sits there many extensions of the results for various problems in addition to this once we have very statistical computational tradeoff for you algorithms a bunch of other complex lower bounds for testing that basically show that the robust this thing becomes at least as hard as learning information theoretically we're going to scare us with anyone who's interested online. OK so I'm going to skip all the stuff so what is the so what is the punch line so I discovered technique to prove as cool or a bounce for. A family of trying to mention our supervised mission problem. The intent of conceptual messages we go K S Q algorithms are actually a family of algos more for which we can prove unconditional or bounce. And in fact these lower bounds come with explicit instances so if you don't believe that as Q. is strong just try to show these instances so these are not the reductions they are explicit instances so design a family of algorithms that does better on this instances and you have broken askew. In terms of like more high level messages we showed the robustness can be hard computationally and information theory to try it makes a difference for learning and then for future directions or gay I mean the less inspire the open question is to find further applications of this framework we have some new ones and other kind of evolution we need to understand the power of so right now everything is based on empirical observations when we don't have. Formal evidence of this model is strong just the fact that our algorithms fall in this category would be interesting to look at alternate evidence of computational hardness for these types of are very scarce problem so N.P. Cartman's is a no go here you need to start from some obvious case assumption that doesn't seem to be a fear of God. So we need really like a deeper understanding of that ability in answer provides learning and at this point I'm going to conclude. I mean this distribution of problems right so you cannot start very dramatically different stuff from worst case assumptions and prove occurrence for others case problems or at least it's beyond what we currently know how to do. So the only other duction that I know for a distribution problem the basic reduction from planted click to say so yourself from some other it's case probably believes cards and in fact we have evidence discards and we have some other obvious case problem is hard I don't know reduction that shows that a decision a problem like that is actually N.P. hard just seems. Difficult to get. I mean this is an Oracle model right so you're out of the decision tree so if you do is decide on a query you get an answer based on the answer you decide on your next were etc So it's a decision tree things are going to actually put over a lot of bounds on the number of question the accuracy that you need and not without the algorithm that lots of the samples for example could do you know just how them take the pattern take more five and then do gotcha elimination on the rest. Right so yeah for example so. So the model is powerful by observation but it's not complete so there are things in particular one thing that we know we cannot do in this is Gulshan elimination of a finite feels but very like for these continuous problems you know where you know serving a linear system over F. two is not necessarily what you want to do maybe you know maybe to strong enough so in particular like a conditional or a bounce if you don't if you don't make a sort of assumption or you computational model are crazy hard to have no idea.