[00:00:05] >> The forefront of the massive optima. Political Introduce of whom. You know Superman to be is he got his Ph D. from love goes. He's done music sport in many different pieces. Including. The 1st suppose he could be the one who would talk about a lot of his of the Works is on. [00:00:28] This company today I'm implicitly connections between different animals or discrepancy of that which is quite interesting but really he's going to tell us something more factorization the conditions. Thank you here pleasure to be here. And to give the 1st Talk of the semester I guess so I'll talk about. [00:00:50] Differential privacy this is work with my student Alex Edmonds and with John Orman who is at Northeastern he's very recent work. I would not assume that you know anything about differential privacy and. Sort of start from 1st principles hopefully fun for everyone there will be some some optimization in there as well so that's I think should be fun OK So let me start with. [00:01:19] Sort of the plan for the talk and I'll start with an introduction to the local model of differential privacy or see that what that is in a minute. About a minute so let me start with what's probably the basic question in private data analysis which is how you collect data or how do you collect statistics how to compute the distance about people where the statistics can be about things that are sensitive about sensitive information that people may not want to reveal to you so yeah maybe list try this let me ask you this question How many of you have cheated on an exam raise your hands right. [00:02:03] Some people volunteer I'm going to guess this is a this is an underestimate maybe I'm going to guess this there is there is some bias here Richard has cheated a lot so he gives us his hand up. So and that's I guess the utilitarian warry about asking a question like this is that people may not want to answer anybody would answer the mail I gave you have the option to not answer they probably will not answer. [00:02:33] If they have if you don't give them the option not to answer for example. I think by law everyone must answer the census and then they may just lie if they're worried about how their sensitive information will be handled. Of course is like a less you to Darien worry that you're just disclosing people's private information. [00:02:55] And you may not want to do that just for sort of I guess moral reasons as well. So this is a problem because war is statisticians that deal with data about people for a long time and already in the 6th is Stanley Warner thought about this and propose a simple. [00:03:19] Possible solution to this problem or possible way to sort of. Make this problem not as bad OK so what he proposed was too random to give respondents the choice to randomize their responses so again this was the sixty's so the way he thought of this was that everyone would get some sort of device think of it like a dial that you sort of. [00:03:43] Imagine I'm the interviewer and I come to you and ask you this question but I give you this device and you hold this dial but I don't see the face of the dial and then you press a button or something and then sort of the needle turns and ends in one of those 2 regions which tell you either to lie or to tell the truth and the idea is that what this dial achieves is that with some probability tool and up with some probably through and up on truth and the probability to end up in the truth region will be slightly higher more than half in the probability to end up in the life region. [00:04:19] Right and then the hope is that when you see where the needle points you'll actually do what the needle tells you to do so if you have so let's give this to Richard he has cheated on or an exam but then the needle points to lie and then he's going to tell me no I've never cheated. [00:04:40] And the point is that this gives you some plausible. Deniability because I don't see as an interviewer I don't see where the needle was pointing. I don't know whether the ass you gave me is a truth or a lie and you have this possible deniability that. Whatever you say I can always think that well there's some reasonable chance that you said what you said because the needle pointed to why and I cannot really be sure sort of why you told me what you told me right so this was a water solution and then she did a little bit of simple statistical analysis that we also see some of the to see how to use this kind of idea and of course these days you don't get a dial but the device can be just a just a piece of software say your cell phone or in your browser and in fact i Phone. [00:05:35] Nowadays do that with some of your data they sort of spin a mental dial in there and. Report some of your data with randomized response so this is something that's actually being done sort of in a software way or this makes us actually stopping them with questions. I think we have enough time for questions so let's dig a little bit more into why this sort of thing work and what I say work there are 2 components to this why is it your privacy and why this is to allow you to war and thinks about the population as a whole so maybe you asked this question in the previous slide because you wanted to know what fraction of Georgia State students and professors have cheated on exams so if you want to know just a physical information. [00:06:24] The hope is that this randomized response technique still allows you to sort of learn something about this fraction. So let's 1st think about why does it give you privacy. And I sort of already talked about this but let's kind of be a little bit more precise. Let's look at the probability distribution as. [00:06:45] Of your answers assuming that you actually do what the dial tells you to do. If you have actually cheated in the past or if you have not cheated in the past so this is the distribution if you have cheated in the past this is the does the distribution if you haven't and intuitively the reason why this gives you privacy is because whatever acid you give me the likelihood that you gave me this Asser is not that different whether you were cheater or not imagine peeping something which is. [00:07:14] A little bit bigger than half but not very far from half so I bounded away from 0 to one then be a one minus B. or not that different Think of this is 11 half of one and this is one half minus epsilon. OK so the likelihood of you giving me the acid that you gave me. [00:07:33] When you were a cheater and when you're not a cheater is about the same so I cannot have any confidence what the truth is about you OK So this is intuitively why this gives me privacy because I don't from from your answer can I do for a much about whether you have cheated in the past or not that makes us OK but on the other hand if I add if we do this with many many people the sort of aggregate statistical signal still comes through all right on average it's still going to come through it's least if we think of you know some cost on get away from 0 point one So and it's very simple to do the statistical analysis is kind of from the level of a not very hard exercise. [00:08:20] For example if we take Y. to be the number of people who asked us to my question class so yes I have cheated in the past and if N. is the total number of people that I have asked this question to. Then this Y. is a biased. Biased estimator of the number of people who have cheated even if everyone is truthful but it can easy to correct for the bias so I can compute the expectation easy to subtract sort of the bias component of that and rescale I get this sort of expression Don't focus on what it is exactly just the point that you can easily do this by just knowing why an N. and P. and A I can get an unbiased estimate of the fraction of cheaters among all the people that I have asked and then I can also easily compute the variance of the same thing in the main point here is not the exact expression but that there is an end into the donators with the variance still goes linearly to 0 with an OK so if I ask enough people the virus will we Venter to be small enough so that I can have some reasonable confidence that my estimate is not too far from the from the actual fraction of cheaters in the population. [00:09:27] Assuming that you know because of the privacy guarantee I gave them now they asked a truthfully and they don't lie to me as it makes us so this is basically what Warner to 65 and let me now tell you like a more modern framework that sort of tries to capture. [00:09:48] The class of sort of computation of algorithms that give these sort of randomized response guarantees of privacy and accuracy OK So this is sort of a model look at. This idea by Warner and it's called local differential privacy it's a little complicated to track the right definition but what I'm presenting to you. [00:10:11] I think it's fair to say force was done in this world by. One of the new SIM Nicole by Smith. So 1st let me be a little bit more precise about what they do is going to be for me so I'm going to think of data as just. [00:10:28] When you think of data as basically a table but this table will never be stored in one place but if you like think of the data as aggregate this will be some sort of table like this so you should think of every data point as sort of. A bunch of values to a few different attributes OK so here should think of every row in this table as the data about one person and may be the sort of the kind of information about the person that would care about this let's say gender education level and age and it would be convenient to talk about the data universe or if you avoid databases course you wake all this like schema of the data and it's just the data universe which no I did not buy this. [00:11:12] Germanic looking you letter is just. The set of all possible data points that a half. In this case it just all possible settings of the 3 attributes that I have for example or possible let's say a lot of settings of these 3 attributes but in general this would be just the universe of all possible data points OK. [00:11:37] In this local setting these days not even all in one place he said every person holds their own data point in their hands and they don't give it to anyone so you never aggregate all the data instead you're trying to do what randomized response that which is sort of let people tell you something from which you can in for things about the data as a whole Ok so the computational model. [00:12:01] So I haven't thought of the privacy guarantee yet let's now fix the computation model the composition of model is that every person holds one piece of that they had a one data point and use this capital X. for the collection of all and data points and as the number of people have that Emily collecting information from OK so every person holds a data point and the data point is one point in the universe of possible data points and then the computation of auto is very simple every person applies a randomized algorithm to their data point. [00:12:33] OK So all told this friend of my sorry I'm by a 4 person I their plight is trying to my style bring them together a random variables Leon. Which is a function which is a random variable like a random function of their data point and they all send their Z. ice to the server and the server does some post processing and outputs the finalizer So this is the computation all right yes so for example before you go back here. [00:13:05] Dizzy eyes may be just as easily answers yes or no to the question Have you cheated ever. And the final postprocessing So let's say what eventually the store wants to output is some estimate of the fraction of cheaters so this panel function may be just like subtracting the bias every skating. [00:13:25] Obviously this will be very very simple but that all was OK So yes so. This final processing may be or usual be some sort of risk aiding and subtracting bias kind of thing OK that and I guess like a small technical detail is if it's help or maybe these guys can all share some randomness. [00:13:49] Sometimes helps is the competition model. I haven't said anything about privacy so the next piece of this definition is. The privacy requirement so what we're going to use here is this very successful framework of differential privacy which you can say formalize what I already said about randomized response sort of generalize it. [00:14:14] All right so what was the thing that made randomized response arguably private. It was the fact that anything you output is roughly equally likely no matter what your input was and these were differential privacy which was introduced by Sherry Smith requires So it requires that each of these functions and each of these are by ice that tell you what percent of the server satisfies the following again these are randomized algorithms which requires that for any 2 possible inputs to. [00:14:47] And all possible events in the output space of the by the probability of this event under input X. and the probability of the same event under input X. prime should be about the same so the ratio between those 2 probably should be between each of the minus epsilon into the plus epsilon OK you should think of epsilon as being small like a small constant. [00:15:10] 01.01 this is very optimistic but so this then is kind of like one minus Epsilon and this is kind of like one process. Gets so there is formalizes the kind of privacy guarantee that I was talking about a minute ago. And it's an easy exercise to see that randomized response satisfies this it is for small if someone with the probably to tell the truth set to about one and over to just make sense questions their style. [00:15:56] Knowledge you have for. The question so for this for this setting. The things that people have implemented in practice epsilon is usually a large. I want to use anywhere. I'm focusing on a small up so on what I'm doing but anything I say can be it can be sort of translated just the math is a lot on nicer with more of someone. [00:16:24] And maybe something would require a little more work. In this case yeah. Yeah. Yeah so you. Think of excises bits are as numbers. Bigger numbers yes so. Engender fresh privacy would have trouble unless the range is bounded. If the range is bounded and you can adapt randomized responses some way. [00:16:58] If the range is now bounded then you would have trouble often you can also for lower bounds if the engine about it unless you somehow redefine what you want to compute maybe not really so if the range is not bounded you should already compute the mean you should compute the median probably and then you can say things. [00:17:16] Are so different privacy maybe not formally but at least informally seems to have some connections to robots that they sticks and carry things that are not robust they're also not easy to do with differential privacy and the other direction doesn't is not that clear but it seems like you can often take robust things and make them different you privatise or so in that sense like me unbounded ranges can be problematic yes or. [00:17:46] So for a good privacy rights Yeah this is getting more into kind of these practical questions so. Small makes a lot of says if you actually want to give a reasonable privacy guarantee so this privacy guarantees a little bit when I start to get larger gets more and more meaningless. [00:18:03] What seems to happen in practice is that companies want to be able to say that they guarantee some privacy but there is some hard requirement on how much information they also want to extract from the data so then often absent any set large enough so that you can get useful information. [00:18:21] It's a question of what berries are I guess but if you produce privacy absence of a small. One or absence where. It was a question in the back maybe. Yeah yes basically So in this randomized response case. Then B. would be exactly half and you basically be giving a random ass or just a purely random ass or like a uniformly random. [00:18:58] You can just forget whether you ever cheated or not and just keep giving a random answer you see is very private. Which is there or was there one more question if you're OK. So that's yeah that's a good segue to the next slide which is. We all know we also want to extract some information so we don't just want to if we really just want to do just the chief privacy we can set up a random noise and then everything will be perfectly private but if of course we would extract no information about the data. [00:19:32] So let's. Try to also formalize a little bit. Some things that we may want to compute from the data and start talking about how well we can compute the. So for towards the end of the talk I'll talk a little bit about sort of learning in the back model is one site before most of the talk I'll focus on something very basic which is counting things which is kind of the cheaters example as well. [00:20:00] So I'll talk about counting queries. Which are just sort of queries to the data that ask how many or what fraction of the data points satisfy some property so the example so far has been like the property has cheated but you can get other properties so in general and called this as some predicate which is just some function that there was me yes or no some binary function on the universe of possible data points so binary meaning it maps every data point towards 0 or one so 0 does not satisfy the property one means that it does satisfy the property and you know another possible predicate is for example with this kind of data scheme of this data universe could be is the 1st attribute is the gender actually be equal to 4 male and then if if and only if you have a data point we like M. in the 1st field then you get a one otherwise you get a 0 and you can consider a way to things but almost to talk about actually just like counting queries. [00:21:06] And then our overload and patient a little bit and sort of Q Was the predicate but also right Q. of X. where X. is the data set and this will be just the fraction of things in the dataset that satisfy the predicate. OK so forth sort of this predicate this will be the fraction of males in the whole dataset so this is one kind of thing that we want to compute. [00:21:29] Is very basic but it's very useful so governments companies researchers want to compute counts of things all the time. This is basically what the census releases every 10 years is also sort of most accounts of things it's also useful building block for more complicated things so you can implement any S.Q. algorithm using. [00:21:52] Queries like this and we'll talk a little bit about using it as a building block towards the end but I think it's a reasonable thing to try to compute from your data at least to understand things at 1st OK and. So. Now we have a company query basically what we saw a randomized response is the following so. [00:22:15] We talk about a particular counting query but for any accounting query we can do the randomized response. Mechanism method that we have already talked about we said the P. There are probably to finding the truth to be absolute over to roughly so that we get absolute differential privacy and then after sort of subtracting the basket rescaling we get some estimate or let's call it Z. so that the expected error the expected sort of absolute difference from the truth is Alpha as long as that you have enough data and how much is enough data it turns out just for error altho you need and to be at least one over absolute squared Alpha squared. [00:22:55] So if the data is just sample from some underlying distribution I.A.D. usually to get air out for you need this much data so one over Alpha squared. And if you also want privacy. You need like you more need is epsilon as well as EPS or square and this is again for smart so yourself. [00:23:16] The math is easier to do for smart So let's let's say I have focused on small Absalom for the. Dr Corder right OK. Good then and for a single counting query it turns out this is optimal or this was known already in 2008 it's a nice paper that has other results as well by Mel Nissen and all merry. [00:23:40] Go round my 1st post is optimal for a single counting query so we've understood one counting query great so that research project is done. Of course usually you don't want to just compute one counting query user you're interested in counting at least a bunch of thanks or else if you using counting queries to implement something more complicated you see one counting queries are going to be enough. [00:24:05] So what will most talk about. Is collections of counting queries are called in query workloads. Borrowing terminology that has been used a lot in sort of the database community working on differential privacy. And if we were quoting just a collection of counting queries just counting queries and over again overload notation a little bit and I use this got a graphic queue for the collection of counting queries but also when I write Q. of X. This will be the vector of query answers the vector of true query answers on the dataset is thanks to yourself. [00:24:44] And now of course you can ask Ok how to adapt the only mechanism I know so far which is randomized response to workload of queries is a very simple way to do this which maybe it was not the 1st way that people thought about but there is a very simple way to just take this round of my response mechanism and use it for K. counting queries and it is just. [00:25:05] For every person so basically kind of randomly partition the. People in a data set and assign each person to one of the queries and that person is in charge of mastering that query So in other words I go to every person I 1st picked one of the cake was at random and then they asked them with a randomized response to last or that query for me so that way how about end all work a data points that are useful every one of the queries or maybe an expectation to have an over K. data points per person. [00:25:39] So I expect to need kind of times more data too as are all Cape queries and in fact with the same error as as one query and in fact that's almost true but you get an additional log out of some union bound if you don't buy somewhere so let me sort of. [00:26:00] Do this a little more slowly so by doing this mechanism you can get some output Z. which is now a vector of K. numbers which are the K. answers that estimates of the K. queries and if I take the expected maximum error. So the expected maximum difference between what I'm outputting and the truth in absolute value this expect a max America's most alpha as long as I have an i get enough data and enough data is roughly K. times more than I needed for one query but is an additional lock a because I needed to take care of any union about somewhere OK The real problem is privacy if you do that. [00:26:51] You get you'll get sort of case where you few so if we use all data points for all the queries. I would need to divide epsilon by K. with the analysis that you're talking about and then I get because these apps or square don't get case where. There is a more sophisticated way to analyze what happens if you like reuse if you. [00:27:14] There's a more sophisticated composition but since I haven't thought about composition I think they get off me. OK And now we can ask is this very simple minded thing the best you can do and in general it is. Maybe this was known early or maybe not I'm not sure I say but it was definitely proved. [00:27:37] In a way right that there exists some work load of queries for which you cannot do better including all the log factor so basically up the constants there is some workload for which if you want air alpha so expected maximum error Alpha by this is it run fine with just like an ocean of air I know I can ask for with high probability but this would mean one more parameter Sol I'm doing this so that I don't have one more parameter. [00:28:05] So. They show that if I want expecting maximal or alpha then they exist somewhere called for which you cannot do better than this. But of course this is not true for all work holes now so basically if you have a single counting query it doesn't really matter what the counting queries but you have many it's really matters how the different queries kind of correlate with each other in some sense so as a says like a simple example in with your mind better example but as a simple example if my workload has the same query repeated K. thanks of course I can estimated with one over absolute squared Alpha squared data points and air alpha if you have just the same query over and over again this is an this doesn't require this much data. [00:28:54] So now how much data you need to get alpha really starts to depend on exactly what workload you have. You. Know it's a quite easy to construct it's. You can imagine every person has K. bits and every query asks about the persons I fit. You mean in this in this mechanism. [00:29:37] Yes So every person is. Randomly and independently assigned to a query yeah let's say for each person is indeed but is independently assigned a random query from the other people and he and then independent randomized response. You're. So you have to be careful about privacy but. Let's say the private definition should call only over the random points of the person so if they fix all the random choice of the other people except your random choices your your privacy should hold respect to your and them traces that answer your question other questions all right so what this. [00:30:43] OK so what am I a what am I getting out of with this so we sort of understand the worst case behavior vastly counting queries but I want to get a more fine grained understanding of given a workload Q What is the best that I can do for that workload so we see that for some workloads this is the best that I can do but there is there's at least one trivial workload for which it's not so I want to understand what is the space between those things so this probably the most trivial workload of like a things. [00:31:13] I want to understand sort of for space OK so this sort of gets me to. Stating our results so 1st I need a little bit of a notation to sort of formalize this notion of error. That have been talking about this is just. To kind of make the error statements more concise so I'll measure error is what I'm calling the worst case error of a protocol by. [00:31:39] And this is the maximum. Over all datasets. Of the expected worst case error overall queries OK so what I mean by this notation is this is the final so I'm assuming that the mechanism days the workload and the data set this is the number that input this is the private input and I'm using this notation this subscript you to me in the ass or that I got in the final output for query Q And this is the expected maximum error over all the queries so this is what I'm calling the error so the error if we fix Q. when the product or the queries Q. in the protocol by this is a function of N. again in general you hold the Derek goes down with then. [00:32:23] And sort of the inverse function of that is what I call the sample complexity so this is so the loop for the sample can place a kind of inverse this relationship and asks What is the smallest data best there are set size for which I can achieve or at most alpha OK this is the kind of Us way have been stating the bounds so far for randomized response so for example what we have talked about so far is that the sample complexity on any workload of randomized response this was around in my 1st post where assigned people to random queries for error on the order of number of queries times log number of queries or alpha squared up source where this is the actual access questions. [00:33:12] And now with this so the American to introduce this notation is to be able to talk about the optimal sample complexity and this is the minimum sample complexity over all different private like locally differential 5 mechanisms OK So this is the best that I can do for for a workload Q. and I want to match this work call for workload. [00:33:31] And this is what our result is so roughly speaking up to some approximations we show there exists. An algorithm which is relatively simple I hope we'll be able to define it and it's also efficient so if it runs in point on no time in the size of the input which is the data set and the queries. [00:33:53] It has the following properties so the sample complexity of this mechanism which I call approximate factorization so that's where the air comes from so this is so for any workload Q. the sample complexity of the mechanism on Q. and any error Alpha is it mostly optimised amplitude times lucky OK but is at most the optimal sample complexity for a slightly smaller Alpha where which is Alpha Prime which is on the order of our for over a law on a well for so this is saying that whatever I however many samples this this protocol needs to achieve are our phone is no more then a little bit more than what you need to achieve era slightly less than alpha so these sort of a by approximation kind of thing. [00:34:39] The sort of the 2 kinds of approximations are a little bit annoying but. Well I don't know how to not have that. Does there is our statement makes us so it's like some form of like kind of approximate opportunity kind of guarantee. The protocol is going to do something with X. just like randomized response does something with with X.. [00:35:11] So there's like something that the person is supposed to do in the protocol like in the case randomized response you're supposed to sort of look at this dial and. So the assumption is that Yasser truthfully given that you have privacy guarantees. So part of the product always is sort of in the in the case of randomized response with a just a single query part of the product always sort of look at the dial and report your ass or of the opposite of your ass or with the right probabilities so in this way what you are interacting with X.. [00:35:46] Axis so the way we define areas like worst case over all possible data sets so this is sort of far so this is sort of instance optimal in the sense that it's optimal for every workload but the area still measured worst case overall data sets of size N. it's that. [00:36:21] No it's not the protocol it takes so hopefully you see the protocol takes the workload the workload is public information so you don't have to be private with respect to that the protocol has to be bright is the soul I should have said there exists a protocol which is absolutely freshly private OK so the park was on to FRESH AIR probably with respect to X. and it takes Q. as input as well and it does something that depends on queue you have to because the optimal thing to do for different workloads is different I'm not saying a Rudy have an argument that I have to but that's what I can do. [00:37:15] Right it is Mayfair if you don't have enough data it may fail for small or awful OK I think other questions OK There's also access. OK So this is a series I would improve for secure learning any concept class if I have time we'll talk about this a little bit. [00:37:38] I should say the similar results are known for another model differential privacy which is the central or curated model where all the data is in one place and you don't have this distributed interaction. There is a long line of work this is just some of the citations. In the central model generally in the central model. [00:38:01] Approximation factors are much worse so like the Alpha Prime is either much smaller than alpha by a factor of log of the universe size which can be very very large because the universe can be very very large. Order you have something here which is not just located but then maybe located over alpha so if or if Alpha a small you get much more is about so this is just to compare with prior question about the results already so in the remaining time I want to try to give a sense for the protocol looks like and hopefully it would you will see how we didn't like how it does something based on the queries which it has to. [00:38:45] Get so I do have to sort of introduce a little bit more formalism to be able to talk about the protocol but it's all very natural I think so you know asking questions of anything at here so it doesn't to be convenient to have a linear algebraic way to talk about databases and queries so the 1st step to this is to represent a database as is a vector but it's very it's a very natural vector we call it a histogram but it's basically just the vector that there's the Imperial distribution of the data set. [00:39:16] As a vector OK so it's a vector that's indexed by the universe and for any possible dataset any possible data point X. enters corresponds to X. in the histogram H. is just a fraction of points in the data set that is that is equal to X. It's really just the vector representation of the Imperial distribution that corresponds to that. [00:39:39] Excess So as an example let's say my data universe was 3 binary attributes every data point was just like 3 bits. Unless I had this like data set of 6 data points and. Something happened OK with what's right what's not good so what I was supposed to be read is the host of things and national happen I'm sorry about that but what I meant to say is this is the histogram that you get out of this data set then for example I have 2 instances of 0011 here and one here. [00:40:16] Therefore the entry to correspond to 001 is 2 out of 6 Ok hope it makes sense questions about this. And the other piece of sort of introducing some linear algebra into this is to represent the workload as a matrix so we can always OK and you can maybe see where this is going. [00:40:38] And it's. Very natural there's a one a very natural way that we used to represent the workload as a matrix so this is a matrix indexed by the queries so the rows in this way the queries the columns in this by the universe and once and the entering the metric of corresponds to Q A query Q. and a data point out of one X. It is the value of X. In other words the rows of this matrix are just. [00:41:05] The truth tables of the queers and really if I'm going to give the query says it as input this is the most natural input format that I'm going to specify the queries ass OK So again as one example if I take the same weight 3 binary attributes 3 the bit universe I am lets say one way marginals which are very basic class of queries where I have in this case 3 queries one corresponding to each of the 3 bits and the query just asks What's the I think bit of the data point then I get this workload matrix you see the force are always always equal to the force bit of the corresponding element that corresponds to the 2 that call the 2nd are always there is always equal to the 2nd but the 3rd 3rd bit. [00:41:56] So actually if you read off the rolls in this case they're just equal to these. You know I get 000000. Questions but so why do they do this one thing that's very convenient is that if I represent the workload as a vet as a matrix and the histogram is a vector. [00:42:19] Then that answers to the workload queries are just given by the matrix vector product OK so now I start things like a little bit more linearly out awake in a little bit more than algebraic way about the counting clears OK in this a simple conversation that. The so I've written here I want greedy go through it is simple exercises that are the product W.H.Y. just equal to the true answers to the queries in the workload is just the the vector of transfers right. [00:42:59] Good so with this the other thing I have to give you before I specify any sort of algorithms and this is sort of the last technical piece that have to give you sort of a more general like a slightly more flexible way to ask of queries than randomized response so randomized response of these the way I presented it is sort of not flexible enough to get these optimal results something else which is also simple little bit more flexible sold to do so there's going to be a mechanism that adds Gaussian noise. [00:43:30] Among other things this is a mechanism that can sort of deal with like numbers that are not 011. Maybe to partially answer some to his question. So to define it I hate to have to sort of define a parameter which kind of parameterized this mechanism and this is some notion of how sensitive the queries are to any particular person in the data is a very natural thing to consider in differential privacy because you're trying to hide every person's personal information so it's natural to consider how much can a person influence your queries so this is the L. to sensitivity. [00:44:04] And this is just basically the maximum Elton norm of one person's contribution to the queries. Soul In other words this is the maximum over all possible data points of the L. to a norm of the vector of query asters that you get on that data point. Or equivalently since I already introduced a sort of workload Matrix this is just the maximal Melton norm of a calm of the workload matrix so every column in a workload Matrix gives you the answers to all the queries for one data point one possible data point. [00:44:41] So this by their physician it is the maximum Elton norm of any one of these holes and you can see because all the so I do find my queries that they graduate 00 or one or between 0 and one so clearly this is always more square roots of the number of queries but it can actually be small or something and we'll see an example very soon right so this is some measure of how much a person can influence the query asters and now using this I can define the Gaussian noise mechanism and this isn't what older than differential privacy means a very natural thing but it was even analyze all the way back to. [00:45:21] This time in a work of dinner and the same. So here is the locally difference you probably version of this mechanism. So. Now remember the way the protocols work is every person sends something to the server and then the server somehow combines these answers. So what is now every person on percent of the server they're going to look at their data point. [00:45:45] And one way to say this is they're going to take the corresponding column in the workload matrix and they're going to add Gaussian noise to it and send it to the server or another way to look at it as you're going to look at your data point you're going to see you're going to get the vector of query asters that it gives on all the queries and then you're going to add Gaussian noise to each one of these asters and send this to the server and how much Gaussian noise do you add in every coordinate or to every query you add Gaussian noise which is proportional to Delta sensitivity so this is the variance so the standard deviation would be proportional to the L. to sensitivity of the workload. [00:46:22] That I'm some factor which depends on the privacy parameter OK There should be no delta here I'm sorry and this factor will be rough you want to offer Absolutely I mean. Sold that the standard deviation will be roughly though to say. That and then the server just digs these disasters and averages them and you can see this is an unbiased estimator of the true query answers and it's not very hard to compute. [00:46:51] Various is a sort of how much area get. Good and I should say this is not quite so the Galaxian distribution doesn't quite give you. Differential privacy but that's easy to fix and I'll keep it to make it to keep it simpler so in fact you have to fix something in that they else but you can still say some Gaussian and easy to fix but it gets technical so on I'm not going to talk about their questions so it's a very natural mechanism it just says sort of take how much you can influence the query asters and add proportional noise their frequency. [00:47:32] All right so now let me use this mechanism to actually show you an interesting example way where you can do better than basic randomized response. And the example is threshold queries so again you cannot always do better you can only do better for some queries and this is one natural class of queries where you can do better than randomized response so what are queries very simple class of queries every person has a number between one capital N. and all the queries ask so every query asks what fraction of the people have their number below T Ok so what fraction of the people in the dataset have a number below this is kind of computing the C.D.F. of the Imperial distribution. [00:48:20] OK So their capital and queries the workload Matrix just looks like this lower triangular matrix. If you think about it so I like both columns and rows are indexed by one to capital M. and for these queries randomized response would. Have sample complexity which is normal and which is kind of roughly demure in capital M. but in fact you can do a log arithmetic and then the best it can always log on to the 3rd so you can do a lot better than randomized response let me try to explain why sort of cute and if you thought data structures like on the right data structures this would be probably from here. [00:49:05] Yeah. So you you always add noise which is proportional to the worst case sensitivity so if I just apply the Gaussian noise mechanism to this I will not be any better the randomized response it will do roughly the same. Because I have to like everyone has to add noise which is proportion to the worst cases because you don't know ahead of time. [00:49:29] What data everyone has and this is private information so you cannot adjust for that. Good so the idea will be to not answer these queries with a gotcha this mechanism the idea will be to answer a different set of queries and from these queries are going to reconstruct the access to the original queries and these this these different queries are defined. [00:49:52] Using a binary tree so I can try to explain that probably like once I said that I think a few people are nodding is not that surprising So what am I going to do let's assume began is 8 OK so I'll bid I'll build and let's assume you know in this case is a but in general assume began is like a power of 2. [00:50:13] You can always assume that. Let's build a complete binary tree on the I want to be N. and then every node in the tree would correspond to one counting query and I answered those counting queries so one of the county queries the leaves correspond to just asking how many people have a particular data value so how many people have number one how many people have added to how many people have not have had 3 so I get these like began queries just for how many people have a particular number particular point then on the 2nd level this query asked how many people have had one or 2 this asks how many have tried 3 or 4 and so on in general every note. [00:50:56] Asks how many people have value which corresponds to one of the leaves below OK so every everyone is like a range of possible numbers which is. A range of lake size power of 2 is of these are sort of that Aquarius right so I'm going faster these would Gaussian noise and it's not hard to convince yourself that I can reconstruct one of the original queries by just taking at most 1. [00:51:27] 1 we just wait by adding some of the queries that correspond to the tree and how many queries at most one per level of the tree. OK So for example the acid to how many people have value between one and 7 is given by the number of people that have out of your one a 4 plus the number of people between 5 and 6 and the number of people that have had 7 and I can do this in general right so the point is going to be that the sensitivity to ask of these queries will be small and then I can reconstruct the original queries by combining only a few of them yes tell right so I am not concerned about communication right now so I'll just have every person use the governor's mechanism to last or all these queers. [00:52:19] In the whole tree. I'm not forcing in the area with question. Yes You know what I'm claiming that the Gazans mechanism that satisfy this property ha ha. Ha ha ha. Yeah you're going to hide that so what what you're what you're going to do is easy going to answer all these queries in the whole tree by adding to each one of them and this will hide what values you have in particular. [00:53:34] And you can show this is you I think based here. Or something yeah I think that's a good one so let me sort of. Describe what does happen here so one way to look at this is that I'm factoring my workload as the product of 22 things so it's not obvious that this is what is happening but let me sort of try to to wait briefly convince you so. [00:54:03] On one hand I have the workload measures of correspond to the street queries. To say and then claiming that from the answers to these queries I can reconstruct assets to the original queries by a sort of adding up the values of some of these queries OK And this is a sort of a linear transformation of the assets for these queries so I can encode this reconstruction by another matrix are so I'm claiming there is a matrix R. which encodes these kinds of linear combinations that I need to take in order to get the values of the original queries if you think about it what I describe here is some way to factor W. as are times a day will be the queries that are asked for we've gotten noice and our will be how reconstruct from those answers the access to the original queries that I care about OK so I'm going to apply the Gaza and I mechanism to the queries and coded by a these are the street queries and now the point is that because any person only influences queries on the path to the root from that person awake from there so any possible data value only influences the queries on the path from the data from that they divided to the root the duty is bounded by the square with of logon so they'll to sensitivity now became small but which if it was on originally for W. so I can just add sort of squares of log and Noyce for every one of these true queries and then I'm going to take the so then the server is going to take 2 D.C. queries encoded by the tree or by A and he's going to multiply disasters by our to get an estimate of the original queries and now here the point is that. [00:55:43] In our I don't have I never sort of add more than log and things to reconstruct one of the original queries and I can use this to compute what the various is for query and its body logarithmic and in capital and then from here I just get another log for union bound and I get what I clicked. [00:56:11] So every person has faster all the true queries because. But then for every one of the original queries I only take the values that A that are relevant to that query so to reconstruct sort of how many people are below 7 are our only dulls logon. Yes Yes So which ones are sort of relevant to a person is private information so I have to have to master all of them OK probably one of them by themselves OK I think I can see so let me add so this was like the motivating example let me try to kind of abstract what will work what kinds of things you can do like that sort of a more general mechanism that can do these things so what happened in general I feel somehow I mean by insight in this case but I will see how to do this in a different way. [00:57:19] I found a way to factor in my workload matrix as some are them some A is like the product of 2 matrices and then I'm going to have to take A to B.. I'll take 8 to define new queries and I use the Gaussian mechanism to the queries defined by a soul my mechanism will have every person answer the 8 queries which was the tree queries before. [00:57:43] And add Gaussian noise to the 8 queries and then have the server reconstruct by multiplying by R. and a little bit of math convinces you that the kind of error I get out of this factorization mechanism is going to be governed by the following 2 thanks by the L. to you which was the maximum column norm of A And by the maximum norm. [00:58:08] The maximum Elton norm of any of our this is what correspondent to adding only a log and thinks on the previous light. OK And then there are these other terms like a root block a for some union bounden. You divide by X. one of them OK so the so the main point is that how much area you're getting if you ignore those terms which don't depend on the factorization is governed by this like product of row norm Times Times column nor for this product of the maximum row norm of our times the maximum column norm of a OK So of course what we're going to do then is just try to optimize over all possible factorizations to optimize this product OK so we're going to take the factorization that minimises this product here and therefore going to minimize the error OK And this mean mum is in fact turns out to be 1st if so this minimisation problem turns out to be efficiently solvable using an S.T.B. so nice exercise. [00:59:05] And moreover this turns out to be a normal matrices when I'm going to use that but it's a nice surprising fact it is called the gamma to a norm of the matrix W. OK so this reasoning basically or this error formula sort of invert it and get the sample complexity out of it doesn't sample complexity is bounded by gamma gamma to Norm squared times look a over over and Alpha squared OK And now and this is see more of something called the Matrix mechanism. [00:59:36] And OK so let me see. Now this is pretty general so I can ask Is this the best that I can do. And maybe speed up here a little bit so let's ignore this. It turns out that it's not the best that it can do so there are natural families of queries where you can do better than factorization mechanism. [00:59:59] The most natural example or a K. way marginal queries where you have a dataset. With D. binary attributes and every query sort of selects K. attributes and asks how many data points have. Particular values in disk a attributes. So here are the Gallatin norm is on the order of D. to the case of this is not much better than a simple randomized response but there is no mechanism that achieves something with sample complexity on the order of D. to root for cost of the alpha so you can do a lot better than random mice respond well both randomized response and the factorization mechanism. [01:00:42] So all so what is the main idea of that algorithm it somehow uses polynomial approximations to achieve what is essentially and that is a sure way to get a factorization again but not of the original Matrix but of some metrics that's approximately equal to the original matrix in every court in it OK so so so the trick to a beating the speculation mechanism seems to be to not factor in the original Matrix but the matrix which is close to it in every corner of the matrix. [01:01:17] And they can do this using polynomial oik. An approximation of the period by a lot if you put it on the else but once you realize that that's what the polynomials achieve you can sort of make that your goal. And this is sort of our final mechanism which is. [01:01:34] Based on the following observe ation that if you have some matrix W. kill them so that in every quarter in a W. and W. killed the differ a bit more say alpha over 2 then every histogram piastres those 2 matrices give is going to be off by at most powerful over 2 in every corner. [01:01:52] So it's enough to approximate the queries given by W. childer. And what I'm going to do is just sort of find a W. 2 that it minimises the gamut to a norm and then use the factorization mechanism for W. children and now I get this approximate factorization mechanism which has almost the same guarantee but the sample complexity is now bounded by the gamma to a norm the approximate gamut a norm that is the minimum damage to norm of any W. childer that is within our to in every entry from W. which can be a lot more sometimes OK And the main theorem is that. [01:02:35] The optimal sample complexity of any differential private algorithm is also bounded be a lot by this approximate Gallatin are over and over absolute square Our Square and again. All right. Let's move his computer that's why this is happening as we. Get Yeah this is like then gets sort of club there this is proved by as you Dawtie. [01:03:04] Somehow and I'll skip the sorry about that and we get sort of an hour this results for gnostic learning which basically say that the best way to do all of gnostic learning is to kind of answer the corresponding counting queries that tell you how well each concept fits the data so you can ask about this offline and I'll stop there thanks thanks. [01:03:44] Richard was asking them. Yet. I don't know. Of the program will find the optimal factorization so right so I guess other tree out will give other. Factorisation us and you cannot draw you you can be what I showed by a constant but you can't beat it by more than a constant with a factorization approach. [01:04:25] There's still like a log gap between how we can do with any absolutely fresh or private algorithm but we do sort of wait factorization things you cannot beat this lock to lock you so I guess somewhere something else going to blow up when you take different layouts. It is also feels like an exercise of given in that of structures but. [01:04:50] If. You like multiply by the branching factor minus one I guess and yeah of this list of. All your possible. Yes there's a good woman though there seems to be some non-linearity that's happening there may be I mean you sort of losing information with the 2nd approximation so all. [01:05:31] There for the final mechanism sort of answers different queries which are guaranteed to always give asters that are close to the original cars so yeah there's like 2 levels of like aspirin different various. In a way in with the tree thing you asked her a different set of queries from which you can reconstruct the original query so noise formation is getting lost but you sort of doing this factoring to kind of get lower sensitivity. [01:05:59] But then we also saying well what if what if I don't insist on being able to reconstruct the original query Swati fi sort of just approximate them and allow losing some information so for example a cube you may really lose information like you may end up with a sort of low or rank queries in some sense from which you don't have enough information to reconstruct exact answers that your original queries and that may help. [01:06:32] Us thank you.