okay so I'm going to talk again about unfairness in algorithmic decision-making it's going to be a little bit of a breath talker than a depth talk it's three results trying to squeeze all of them in mainly because I don't know how to present a series of derivations of conditional probability you know ever more complicated in a talk and make it interesting so I decided to keep it to the sort of the intuitive picture and just give you a survey of some results this is three different results joint work with a number of people most notably Jamie Morgenstern here who's not here let's see so there have been lots of headlines of this kind lately that algorithmic decision-making holds a lot of promise in situations are that are critical to individuals such as hiring where maybe the hope is algorithms being objective will avoid the biases that humans bring to this task for lending banks and a little bit more even more seriously you know predicting criminal behavior recidivism and so on based on data about individuals so some of these articles are written in a very hyperbolic fashion about how the future is going to be fair and objective and so on sentencing is being done in this manner like how long sentences should be and predictive policing as well is another application of these kinds of techniques so on the one hand these hyped up headlines say oh the world is going to be wonderful from the from now on because algorithms are going to take over but then of course you also have you know the natural fear is that big bad data may be triggering discrimination because using algorithms that are trained on traditional data which has reflected all the biases of society just further further propagates those biases and maybe even a larger scale propagates the same biases and the other problem I suppose is that these algorithms are black boxes that are very hard to understand so we don't know what methodologies are using it's very hard to interpret the results they give or explain why somebody was treated some way by an algorithm so their usual problems like this have been have also been pointed out so on the one hand there is this you know hype hype about this and the other hand is a lot of hand-wringing about how algorithms are going to destroy us I mean so we need to do something and so as computer scientists I guess our goal is just strike some middle ground where we want to more objectively and mathematically understand what what these algorithms do and do not do so this is a talk like much of this area too aimed at setting up some basic definitions and proving some very basic facts which are not directly applicable but sort of laying the groundwork for further work on this field so what is the talk outline I'll tell you briefly generally about what decision-making and fairness issues are and give you background on this talk about one specific formulation multi-armed bandit switch is going to be relevant to two of the three results I present because it's one notion of how to make decisions in an online fashion when there's uncertainty about the decision making maybe some of you know that and then go through these three results pretty quickly okay so decision-making background and definitions so the first question to ask is what kinds of decisions do we want to be fair about and so right at the top so if we're talking about algorithmic machine learning decisions we want to talk about online decision making in batch decision making where online means that the next decision is affected by the previous data you've seen in the stream of data that you're seeing over time we have batch decisions or you've designed your algorithm already based on some data you've learnt from and now you're using that same algorithm repeatedly to decide for a whole bunch of individuals ok so online versus batch decisions it's in the online setting that the multi-on Bandits formalism is useful and in either of those settings we could be talking about the classification problem which is probably the most important thing that we want to be fair about all those examples I showed you or examples of classification where the goal is to determine if a person gets alone or doesn't get alone gets into college or doesn't get into college these are all binary classification problems yes or no and the slightly more complicated is an allocation problem where you might have like a whole bunch of medical treatments available and a whole bunch of patients and your to decide which patients to allocate to which treatment so it differently an allocation and classification or another example of allocation is a conference has a certain number of slots for talks and it has to take a whole bunch of submitted papers and decide which ones to accept and the difference became classification is just that there's a budget an overall budget of how many things you can classify as positive or something like that as opposed to pure classification where the individual decisions are independent and maybe what congressional district am I assigned to and then one the one of the results I talk about is something like a pipelined classification where the idea is not that so we want to consider one particular point where a decision is made - is it fair or not but a person an individual through life they go through a series of such classification they go through whether they get into college or not and depending whether they get into college they get a job or not so imagine a pipeline of classifiers and ask what does it mean for that to be fair yeah yeah I'm getting to that in a minute that is going to be it's going to be a big big topic I guess so defining fairness is a hugely difficult task so if some of you may be familiar with the notion of privacy maybe the definitions of privacy and differential privacy in particular which has become the de facto standard for privacy because it's a clean definition seems to fit many circumstances the picture for fairness is much more messy and complicated where there cannot be what I mean and not only is there not byberry there cannot be one definition of fairness that fits all situations ok so why might machine learning be unfair let me talk about that first before talking about the definitions of fairness so again data might encode existing biases and in fact they fluctuate them even more for example if you use in your feature vector of observations while in individual the number of times they were arrested rather than the number of times they were convicted this is a self-perpetuating thing where the people who have been a more will be arrested again and maybe they never get convicted throughout the times that they're arrested but it's just bad data to be using to decide whether to arrest someone or not so this is data might encode existing biases like that and of course there's the usual there's a problem of one directional feedback meaning only if you give a person a bank loan can you see if that person will repay the bank loan or not you don't see it for the people you refuse the loan - so you only get data for deposit the people you're treating positively and then different populations may have different properties and I'll say something more about this in the next slide so I'll leave that aside my definition we have less data about minority populations so obviously we are not well trained on minority population because there are fewer of them that's a definition and then because of that maybe the burden of exploration may fall disproportionately on different populations but let me say so for example this is two populations and we're deciding whom to admit to college and the number of credit card sounds kind of crazy but it may be a measure of their wealth or something like people's wealth and people who are wealthier may employ SAT tutor Sat tutors and so this course of the people with more credit cards might be higher for that reason so you have a green population and an orange population and if you want to find a linear classifier say for each one so just pictorially the best classifier for the green population looks like that and notice that that classifier if we use it for the overall population it denies all the qualified people from the orange population completely just says no to everybody who's qualified from the orange population so in fact the if you look at if you look at the correct classifier the orange population is lower down there but if you wanted if you insist passes are positive examples people who would be good in college I mean so this is somehow labelled data you're given and the labels are saying that pluses are the people who succeed in college and the - of the people who won't succeed in college let's say yeah so the linear classifier of the green would do that and in the in the mini class for the orange would do that and if you now if you decide to have one classifier for the whole population just because of the numbers there's so many more greens than your oranges you would go with the green classifier and be completely unfair to the entire orange population and so the upshot is that we actually need to take into account the membership the group membership of each person or the orange or the green before deciding what classifier to use and being blind to the group membership is not a solution for fairness okay so so the question is can be designed fair and the algorithms and now we ask what is fairness mean and so I would say this is very coarse but and broad but the literature divides between individual notions of fairness and group notions of fairness okay so let's talk about each so individual notions of fairness for example for lending or college admissions or hiring all these things so metric fairness the notion do - Cynthia twerk and others the idea is trying to embody the principle laid down by this great philosopher John Rawls in a mathematical definition and it basically says similar individuals should be treated similarly and the definition is if you take two individuals X and X prime and think of individuals hereafter as vectors of features unfortunately we have to be productive here but we're going to think think of people as vectors of features okay so just each person is a vector so two individuals X that are all individuals X and X prime look at it so H of X is the output of the machine learning algorithm H of X on the individual X and one is a positive classification is saying that the probability that the machine learning algorithm outputs one on an individual X and - the difference between the two for the two individuals should be no more than the students between X and X prime in appropriate metric space this is some notion of similar individuals so it's kind of a Lipchitz condition on on the classifier itself that it should it should not you should keep two individuals that were pretty close pretty close in the output allowing it to be random so in particularly if it's a deterministic algorithm it better output the exact same value for both right so if it's going to say X's so it's impossible to imagine a designer Chicago because you have to break the set of inputs somewhere and this deterministic wouldn't be able to do that in this manner satisfying this definition right so if you keep constraining neighbors to be the same value and you go all over its a connected neighborhood then you'll never get connected yeah so you need random basically and then there's the notion of weakly meritocratic fairness and this is another vexing question the whole sadness literature which is how do we can we in various situations decouple fairness from merit merit basically so the question is is fairness just the same as saying the meritorious should be rewarded or is it different and I don't know the answer exactly so it's a tough question but this definition says yes fairness should be the same as merit and it says that so I'll give it a mathematically less plate where the individual should not be favored over more creditworthy individuals so every times a bunch of people think of a situation where a bunch of people are coming to the bank for a loan the probability the less creditworthy individual gets alone so sorry let me say it this way if individual one okay let's write it down here so if you take so this is the individuals vector this is the true type I mean this is a why is one if they are likely to repay the loan and why and sorry why is zero if they're not if they're not going to repeat it pay the loan and in general why could be a real number which is a problem between 0 and probability of them repaying the loan or something and so it says the probability that the algorithm classifieds an individual is credit worthy so if it classifies X is credit worthy with greater probable the next prime it better be the case that X is a better repair of loans than X prime so it should only classify somebody as a better risk if they are in fact a better risk right so that's that's the requirement here so again Y could be binary or consciousness works in both cases the binary would be that there they are going to repeat alone or they are not and and continuous is this is the probability to be able to repay the loan so it could be the real number between 0 & 1 or binary this is all individual fairness yes say for each individual not in this particular definition yeah so there are two critiques of these definitions one the first one the question might be where does this metric come from you know how do we know what similar individuals are and and that's a big but it's like kind of you could say the paper that introduced this idea of punted on that question say somehow magically we got this metric and now we can come up with a classifier that does this so it's a it's a it's again understanding this metric may have to be studied in a context sensitive manner for each scenario or something like that but there is no general notion there and in this case this kind of thing is possible only if there's some kind of realizability assumption I don't want to go into that because I'm not going to talk about this result at all yeah but dissimilar individuals right that there's some subspace mm-hmm critical yeah so what Cynthia's answers would be or maybe could be is that is that if you had the right metric those orange points that are lower down would come up I mean in a different metric space they would be the pluses would be close together at the - a bit but what the right metric is or whether it exists even is a question that's not answered yeah yeah so it's you're right I'm basically looking at some subspace where they come together which is that relevant yeah yeah but it's yeah exactly so it's not it's not very clear at all what it means okay so then statistical fairness notions though the basic idea is that you partition the world into groups g1 g2 up to G K of individuals and we want equality across groups of some statistical measure okay and these groups could be insert the in the standard fairness literature these groups or protected groups by gender race you know sexual orientation or ethnicity whatever the usual things are protected by the Constitution so those might be the groups which we would want to protect there's also some very exciting work which is more complexity theoretic which for example by you know Rheingold and Roth bloom and others which talks about every large enough group that can be computationally identified should be protected so if you consider any any group of people that you there's an algorithm that says you can determine if someone's a member of the group or not and maybe an efficient algorithm and then you need to protect you need to get good statistics on all of them that's a nice idea that that could also be further explored and there are definitions like that but I'm not going to talk about those let me just say the base the most basic definition is statistical parity which is so basically saying that the fraction of people in group I I guess I can use the fraction of people in group I who are classified as positive by the algorithm should be equal to the fraction of people in Group J who are classified as positive for all groups I and J so this is with regard without regard to the ground truth about which groups of how many individuals who are positive so this would be okay this is just a testicle parity and it can be critiqued on a number of grounds as well me so this for example is saying that the fraction of people from Group one who get a bank loan should be equal to the fraction of people from group two who get a bank loan regardless of how many people are credit worthy in the two groups so that doesn't I mean okay that has its problems then equalized odds or equalize false positive and false negative rates and this thing is just showing the false positive rate which is conditioned on an individual truly being a zero being a negative for this classification the probability of them being classified positive should be equal for both people both groups so if positive is an undesirable class like in some situations for example are you going to be pulled aside at the airport security line for an enhanced pat-down maybe that would be a positive classification the fraction of people from one group who are not or not threats who are such as pulled aside should be equal compared to the fraction of people or stop and frisk for example is another good example the fraction of people from each group who are pulled aside who are not threats should be equal nothing is said here about the people who actually threat who are positive I mean who are positive mean who are threats okay and so it's equalized odds and then equal calibration kind of reverses the the conditioning there it says conditioned on the algorithm having labeled a person of one meaning a threat or something what is the probability there truly a one this is calibration or positive predictive values a related notion and there should be equal across the two groups okay so so a famous argument happened because there's this program called compass that was being used to predict recidivism risk of criminals to decide if they were to be let go or not and a public service organization called ProPublica published a critical study very well researched so showing that compass was biased was not fair because it had a higher false so if you're talking about negative recidivism so if your if your residue is embraced let's call it positive it had a higher false positive rate for african-americans and a lower false negative rate for Caucasians so this it was biased against african-americans very strongly and compasses response to ProPublica was to say well but it is actually quite it's calibrated and they had the data to show that it was calibrated okay yeah no big that's only true of the first thing the first positive classification the rest of them assume there's a ground truth of which individuals in a group or positive examples in which are negative examples and you better get the ground truth more or less right in both both groups so that's what the other other definitions are saying which this one here now why is fixed why is a given number when yeah why is the ground truth labeled examples so you have labeled exact like my own machine learning you assume by past history we have some labeled examples we know it for some past examples I mean like we'd let some people go and we didn't see in whether they commit crimes are not based and and their feature vectors and we've also given some people loans and we've seen so history is our source of why's but like any machine learning it's the case right there's a training data set which we have some labeled examples and the test data that comes in the layer comes in the future I see so on the training yes yeah yeah yeah yeah fair enough yeah when you're labeled examples are either biased themselves or you only have partial feedback in the way they describe right and for like and for things like I'm crimes you typically only see see the label if they were arrested and released yeah yeah oh you could to click in the middle one here yeah this way that's oh you're saying this is more yeah but yeah this is just the second one is just talking about fitting we're finding a hypothesis that best fits whatever observation you're consistent with the observations you have so far it's true but you might question the observations I agree yeah yeah and so ProPublica and the compass had this great argument where one pointed out the flaw in proper encompass by according to the second criterion while compass defended itself but the third criterion yeah know so here is so so if you predict something to be one right but you can see that label Minh when after you say it's pretty to be one yeah right in this yeah I mean I guess yeah there's a positive predictive value says if you actually are going to give the Laurel do something that can make allow you to observe that person then do they do what they're supposed to do with the right probability yes absolutely right right that's true so so again you're using historical data that's used before this algorithm was put in place and there are some there are such labeled pairs available I guess that's that that's the model yeah or even if this system is being used in some places in other places it's not so we are continuously gathering data of labels of people you know um that way okay unfortunately there's no way to resolve this dispute between compass and and ProPublica because a couple of papers prove that it's impossible to achieve you know equality as equal opportunity and positive predictive values equal positive predictive values unless you had two other extreme situations where the base rates of the two populations are exactly the same or you had a perfect classifier you've had a perfect classifier you have no errors and everything is good okay so really and except for the idea of situations you cannot get this thing so there's no resolution to this country right that's an excellent question and is a we are all in several of us are working on questions of how can we relax this and get approximate notions of fairness for approximately good classifiers and so on what's the relationship between the approximation ratio but it's not theirs results on that yet yeah so that's the background on fairness let me briefly mention what multi-on bandits are next and I thought many of you may know this but I'm just going to assume not so the basic model of multi-armed bandit is a stochastic bandits case which is the simplest model york' arms each arm has a reward distribution D sub I whose mean is mu sub I okay so you if you pull that on you get a reward think of these as casino arms or something like that at each time I equals T equals one through T somebody who's deciding which arm to pull pulls on I sub T and gets the reward drawn from that distribution for that arm D sobriety and the goal of all these bandit algorithms is to minimize what's called the regret of pseudo regret basically if he had only known what the best expectation which arm had the best expectation only and pulled it all the time you would have had an expected reward of the number of x times the reward per time T times the maximum of the mu eyes but instead your algorithm got this much reward I mean the expectation it got this much reward in the in time I T you got em uit okay and the difference is your regret how much you left on the table and then the goal is to minimize that and there are nice algorithms that minimize the regretted let me just go through one quick one the decision maker simply maintains the sample means of all the arms they've pulled so far so they pull each arm some number of times maintains the sample means and then a decision maker has to balance between exploration and exploitation what is exploration is pulling arms that haven't been pulled that much for trying arms that haven't been tried that much as exploration exploitation is just pulling the arm that has produced the best mean sample mean so far you got to find a balance and there's something called the upper confidence bound algorithm you see B which balances these two things nicely and achieves order log t regret in this case and what does UCB do it maintains the sample means the blue representing sample means for three arms but it also maintains a an interval what's called a confidence interval with whose width depends on how many times you pull that on the lot more times we pull you on the narrow at the width of the interval you get more and more sure the sample mean is closer to the true mean so so it maintains an interval within which the true mean of that arm lies with very high probability and then it just pulls the arm which has the highest upper confidence the highest right end of that interval so in this case the second arm has the highest right end so it pulls that arm and and and achieves Auto log do you regret so that's the UCB algorithm and slightly so why does bandit proms arise or why do Ballas problems arise in fairness you can imagine the K arms as being K different groups of people each with their own distribution say after a bank loan situation each with their own distribution of the ability-to-repay and you want to be pulling the arm that corresponds to the group that has the greatest ability to repay now here don't think of the group as entire like one particular demographic group because that sounds like you want to give the loans only to one group and not to others but a group can be influenced by the feature vectors of the persons involved as well so what what the feature vectors is and so that to make that clear we have an extension of the stochastic problem into the linear bandit problem okay I'm going to slowly to administer so so you have K unknown parameter vector Z of K arms which have parameter of X is beta 1 through beta K and then you have rounds T through the usual number of rounds and in each round you have K contexts that arrive k X 1 T 2 X K T so again this might be the context might be SAT scores and number of credit cards for example so the context might be a vector of two values SAT scores and credit cards and the arms might correspond to different groups that green group and the orange group might be different arms and each group weights these context features differently and and gives you a reward whose expectation is the inner product of the context vector and the arms parameter vector so that's the expectations the distribution around that expectation that's the reward and and so the algorithm only observes the reward for the context for the arms that it chooses and again we measure the performance by regret like look at the best you could have done for each context that arrived at each time and subtract it from what you actually got okay so that's Marcion bandits briefly we'll get back to it when we talk about particular problems so let me get on to the three results and talk about unfairness the free-riding okay so it's a slight variation on a multi-armed bandit setting what what is the situation here so when you and actually the learning that's happening is across many groups I mean it's not one individual who's learning one decided deciding which arm to pull but several people are figuring out what best arms are for example think about pharmaceutical research where companies are trying to get drugs figure out drugs and using the research of another company the exploration that another company has done could save you a lot of money so this is the kind of shared exploration situation we're talking about because pharmaceutical companies protect their IP very seriously so it's not probably realistic so let me give you an example so the question we ask here is if you're free writing of other people's exploration can you lower your regret right so basically can you let other people suffer the regret and you get you get a freebie and so so more natural more so realistic example although not so important example is sort of yelp evaluations of restaurants or something like that so you probably you're exploring different restaurants in the city and you want to find the the best restaurant so slight difference from the standard bandit setting now you have each individual who's a diner is pulling an arm putting an arm corresponds to choosing a restaurant each restaurant is an arm so at each round each diner is pulling an arm and the information that's generally possible to have about diners is their context I'll get to the context in a minute and then how many times have they pulled each arm and what rewards they've gotten and Collective exploration the spread regret but the question we ask is let's take a very simple sarcastic case each restaurant has an expected reward meaning how good the dining experience is there you have n diners and K arms and some of the players are public about what they're finding so we get data about each public players actions and rewards and just for simplicity in the talk I'll just as soon as one public player and there's a free writing private player who doesn't who doesn't share his data and maybe the prewriting private player plays the greedy strategy that is placed the arm that has been played the most often by the public player at all times so and the question is can we guarantee the free writing private player then has less than a little o of log t regret and this is true but it's not enough that the public player is expected you know zero regret in fact we need a stronger stronger condition that the public player achieves a regret that's dropping and then so the chance of the regret of a public player is linear drops polynomially where this w is some number greater than two we need W to be at least two and if that happens then an alpha UCB public player would achieve this would achieve a regret that's very small as T grows so the free rider can get constant regret by always pulling the arm that's pulled the most often by the alpha you see B player so that's what we can prove in this to a classic case where there are no contexts yeah that's an excellent question in this case because the nature of the result is you need to only see the number of on poles you don't need to see the rewards but that's not all in general when I generalize that won't be true yeah that's right yeah assuming that the one state there's barely two of you and the other players playing UCB then you know that that players gone to the best restaurant most often I mean the very few times that they gone to a worst restaurant and so you just go to the restaurant which has the most rigor most visits basically it doesn't matter what the average review is it just got the one that got the most yeah okay but more interesting cases when each arm each diner on puller has a context vector what is this context vector mean it might be their preference for vegan food or quiet restaurant these might be the features of the context vector or you know local produce or whatever I mean just imagine all the features you could have about restaurants so and the restaurant so the free rider has a context vector x1 let's say the public players UCB players of context X through X - 3 X 10 and each restaurant has a parameter vector that talks about its quality on each of the dimensions of the quanta like how vegan-friendly is that or how quiet is it or how whatever it is and so on those would be the parameters of the restaurant and the expected reward when you go to a restaurant is a linear combination and so the inner product of your context vector with the parameter vector of the restaurant and so basically what the main results we show is that if the freeriders context vector is a linear combination of all the other context vectors all the other diners context vectors and it's the smallest linear combination this vector C which is the the coefficients the linear combination its norm is in the denominator so if it's large then that's a problem so if the free-rider is kind of you think of as roughly speaking in the core in that in the cone of the vectors of the the context of the other diners then in fact you can the free-rider can achieve constant regret by using the observations of the other other diners ok and the proof to prove this what we needed to prove was that the other diners even though they have a favorite restaurant they optimizing their regret and so on they will visit the other restaurants reasonably often so there's a lower bound on the number of times that any other diner will go to any other restaurant other than their optimum because that lower bound is necessary for the free rider to gather enough data about the restaurant that may end up being his best so we need other people to be going to other wrists but a lower bound on the other restaurants and in fact so so what we can show is it's a lower about and the proof of this lower bar it's nice it's a we show us to cast lower bound for the other players imagine one of the diners the other diners we want to show that in the stochastic bandit setting they will visit every restaurant at least so often even though it may not be their best and why is it stuck at sea because once you fix the other diners context all the rewards become like a stochastic bandit because you can take the inner product to the parameter in the context vector beforehand so it's just the circuit see bonded bandit problem that each diners solving and just have to argue that in such a setting this other diners like to visit all the other restaurants as well and the proof is nice it's an inductive proof that shows that there are times beyond which this the number of times that at time t minus 1 that that diner has visited every arm is at least this much so if T is greater than TK the number of times that that diner has visited all LK rest all sorry T is greater than TK the number of times that diners visited all K restaurants is at least log of T basically logarithmic in T with a K square in the denominator so basically as the number of arms grows the lower bound degrades a little bit so but you can prove a lower bound which is logarithmic in the number of number in the time time duration ok so you see we play it's just what I said earlier so it's basically from the lower bounds we can argue that the free rider has enough samples from all the other players of their restaurants to be able to judge their own value for each restaurant in this setting we need both the number of armfuls by the other players and the rewards the experience we don't we cannot do it without the rewards and knowing their contexts as well ok so on to the second one pipeline decisions when fairness becomes rather impossible so it's pretty much nothing can be done is there is especially if you're gonna be exact again approximation might be the solution so so here is an example of a pipeline decision so these are people taking the SAT exams and then some of them get into this you know versity and then I get all of them who get into university graduate let's say I mean so let's keep it simple and so all of them graduate and then they all apply for jobs and then some of them are hired and then they get it some of them get a job basically so the so it's - basically it's two steps in a pipeline one is entrance to a university and the other is getting a job okay that's what that's where we are and we have a super simple model I mean like this model is completely unrealistic and you can shoot many many holes at it but the idea was we're proving a negative result and though simplicity is not by itself a bad thing although okay I'll leave that for later so the scenario is the following so we have two populations p1 and p2 and let's say that every individual has something called their type and the type is their only is a scalar it's the only measure of quality and somehow this magical number exists for every individual ok this is all again so highly simplified and each population has its own distribution of types d1 for population 1 and d2 for population 2 so students take a test and let's call it sat for simplicity and the stat let's say again the big assumption is a noisy but unbiased signal of that type so basically the set of an individually the individual in terms of type T and then it outcomes t plus Sigma where Sigma is say Gaussian distributed or something and the college uses some monotone admission rule based on the s fat score there's no other data just one eye one number thus at score and the college makes a monitor on admission rule which may be probabilistic or deterministic deterministic monitor Commission rule would just say anybody with a score higher than this is admitted the mono probabilistic would say if you have a score of this much you have this much probability of being admitted and it would be monotone increasing and it may be different rules for different populations will allow that so the admitted students get a grade the GPA let's call it and this is an another noisy unbiased signal of type and condition type it's independent of the Sat score let's say so it's an independent noisy signal of their type and the big thing about the model is that the college does not add any value in input type is the same as the output type and it's just College add some noise is it just signals your type which there are there are economics papers that make that assumption sign we know the economists always trustworthy right okay and then let's say there we will assume a rational employer purely utility maximizing who hires a graduate if there's a positive payoff for hiring a graduate and I'll explain what that means in a minute so we assume explicitly the employer is not worried about diversity and other criteria which many employers are so we keep the model simple the employer is looking for you know some some positive payoff okay so the limitations of the model one big limitation is that there's no correcting for past discrimination so assume that the input types of the students are true representations of their quality and that's all we want to reason about that's a huge limitation tests and grades are unbiased that's a huge problem that's been debunked I mean that assumptions been debunked a few times on SAT in particular but we'll still assume it's unbiased colleges tore that value as I said and employers only care about utility and not diversity etc okay so all of these are limiting assumptions but again we're going to show even in this setting that things are difficult it's possible that actually if you make things more complicated things become possible against I'm not saying that negative results will hold when you make a more realistic model it's just harder to analyze okay so so we different what so what is the goal of affirmative action in this setting you could ask in this kind of two steps two step pipeline and we define two seemingly two with one very and seemingly reasonable goals equal opportunity but end to end meaning not just getting into college but actually getting a job out of college so so an individual of type T should have the same probability of being hired regardless of which population they come from so if you have the same quality individual their probability of getting into college getting admitted and getting a job should be the same for both and let's assume that there's no other way of getting a job except going through college I mean you cannot decide whatever you cannot run around college and get get the job okay so and then irrelevance of group membership is another criteria might have an employer who has a particular type threshold okay sorry I didn't say anything about that so what is a rational employer that rational employer based on observing what an individual has done got into college got in a certain grade and so on it's going to have a son posterior distribution on individuals type and their behavior is just going to be let me look at the expectation of this posterior distribution if it's higher than a certain threshold which is what I call which is my desired sort of the zero pay off threshold a bake even point if the expectation the post here is higher than that I'll hire this individual if it's lower than that I won't hire the individual that's how they employer is going to operate so so basically irrelevance of group membership says an employer with a desired type threshold should be able to ignore which group the individuals coming from although they know it that doesn't help them in that decision it's just the same decision regardless of which group the individuals coming from and then strong irrelevance of group membership is that you might have after college you may go to multiple employers who all have different thresholds so you might want to make sure that you know the is irrelevant for all employers your your type is irrelevant for all employers and in fact one way to assure that is that part the the distribution of students admitted to college are the same regardless of the group so the same same type distribution for all the students admitted from group 1 as since the mid from group 2 so this would be called strong irrelevance of group membership ok so the employer behavior again I've mentioned it's already submit the posterior distribution of student based on her group admission to college and grade have a threshold T for the desired type of employee and the expectation of the supposed ta is bigger than T then a higher otherwise don't hire so so basically we've shown in create group line hiring your to set Admissions and grading policy so that the test is group independent this test should be group independent okay what is grading policy again grade is going to be a is going to be an unbiased estimator of type the only thing the college can control is the variance of the great okay so the expected value is the type of the individual we are assuming so the only thing the college again controls is the variance so I can mention the schools may not know what the employer threshold is so again multiple employers might have multiple threshold schools have to optimize for all of them and then we can ask that the independence of group membership holds for a range of thresholds for different employees with different thresholds and what we can show is that once you start requiring that this independence holds for multiple thresholds really you have to have strong independence of group membership meaning the posterior of the admitted students should have the same distribution for both groups what's your type distribution okay so basically positive results are very special cases negative results are for all the realistic cases so positive results says that if that could be noise free meaning it truly shows the type of the student exactly the type of the student then we can have all the fairness calls we want so what does it do so and you'll see that in a minute the other thing is if colleges don't report grades at all which actually some high-end business schools do only the top business schools through that if colleges don't report grades at all we can achieve both you know equality of group independence of group membership and an equal opportunity by setting a very high threshold for admission to college now I'll show I'll go through this in a minute and and and we can see achieve just a limited goal of Independence of group membership meaning for one employer threshold how to achieve independence of group membership even if you have grades and noisy SAP and negative results is in the more realistic case noisy SATs and gray it's no monotron rule can achieve strong independence of group membership an equal opportunity is only possible by denying everybody okay so that is that that is an equal opportunity solution right okay so so it's the proof ideas so sorry any questions so if you have an employee if you have a bunch of employers and among them the maximum threshold for type is c-plus let's say the c-plus is the highest threshold that any employer has so basically if the sat is noiseless I mean it truly shows your type then schools admit everyone with the score is greater than C plus means they're gray the type the type is better than C plus an employer hires everyone the school admits okay and so this is this is equal opportunity because if your type is higher than C plus you'll get a score that's higher than C plus and you will be admitted and you'll be hired so independent of the population you belong to and you can check that it's also independent of group membership because again this clear in the no grades case also you can achieve you can achieve both because what you do is you set the admission threshold so high that for every group the posterior distribution of the type assuming that score of the individual is better than that threshold has a mean which is better than c-plus so again you set your threshold very high and so you admit the best students who have high types and then every admitted student is hired and achieves both objectives okay okay so deterministic monotone threshold rules maybe I'll just get to two results and I'll stop there because I'm not you know I am the time so basically the sketch of the ideas is that we look at this what is the posterior pops what is the posterior distribution look like this big quantity here the expectation of the type of an individual from population I see their score was better than the threshold and their grade with some G and we show some nice properties of this posterior that it's continuous differentiable and increasing in all of these parameters and also that as the grades go from minus infinity to infinity if that's possible the the type expectation goes from minus infinity to infinity as well so this is a nicely behaved posterior function what's your expectation function and so you can achieve in the independence of group membership for a particular threshold in the following manner suppose the threshold is C what you do is you find you find thresholds beta 1 star and beta 2 star for admission to college in such a manner that there exists the same grade at which this posterior expectation as far as the employer is concerned I become C how do you do that you fix the threshold of admission for one group and you move the threshold for admission for the other group and because of continuity and differentiability and so on that'll be a setting at which they will be equal for the same grid and and then you can get independence of group membership so this is all elementary consequences the strong independence of group membership is just not achievable and follows from kind of similar arguments as the Kleinberg Malena tan and rag of end result okay the third result was let me just say what this result is without actually going into it perhaps Jamie gave this explain this result in her job talk because this is very much joint work with her so let me just not use the slides anymore and just talk about it briefly so there are situations where exploration seems unfair to the individual on hand so one I mean one example is in medical trials so you have a bunch of different drugs that you could give to a patient and these are represented by the arms each with its own parameter vector that we don't know the efficacy of this medication and we are learning these parameter vectors as as if and individuals pass through the system a new individual comes along they have they represent by a feature vector one possibility is to run a UCB type algorithm and sometimes giving them a treatment that may be suboptimal and expectation based on our current knowledge just on just in the interests of science so that we can learn better what the values of the treatments are but that doesn't seem fair to the individual in front of us so you might want to just give them the best treatment by expectation no exploration just exploit what we know and give them the best treatment based on what we know and then the danger is of course that can incur a lot of regret this is well known in bandit literature so that's one example another example is that the myopic agent like for example an Airbnb landlord might not want to explore different groups that they don't know much about they just want to rent to the people that they are aware of and have good data on and so they might discriminate against other groups so again they are they are not exploring for a different reason I mean so the two different reasons where exploration doesn't happen and so there's a greedy algorithm that seems to come into play in the situations and we ask ourselves is that bad always I mean it seems like in general it's bad and what we show in this result is that if there is sufficient diversity in the data in the following sense an adversary chooses the feature vectors of the people coming through the system but there is a random perturbation of some with some variance and no it's a Gaussian perturbation of the vectors and if the perturbation has sufficient variance in it then in fact greedy will will achieve optimal basically optimal regret so in fact it's okay to be greedy and get the best solutions so so that's the result which I'm not going to explain but so that's basically it let me see if I have any conclusions I don't think I'd have any conclusions but yeah so this is what I was going to get more technical on but I want so let's see there yeah okay let me just conclude by saying that you know fairness is is a very very kind of loosely defined space with lots of possible definitions it would be nice at least to consolidate some of these definitions and understand a smaller set of definitions that we need to work with to understand fairness more fully but we are not at that point yet we are just throwing up definitions left and right maybe eventually there'll be some consolidation and some understanding better understanding we're early days yet okay stop here so rather doing distributions for Euro types or data hmm you could try to be fair in the sense that you're always trying to make sure you're as fair as possible compared to let's say the future right right right right the adversarial settings in another yeah because the regret bonds are worse but but that's fine and but yeah yeah unfortunately in that setting is probably not possible because the geometry of the problem is such that the freeriders context vector is going in a certain direction and maybe the adversary yeah good my instinct is that it's not it's not possible but I don't have a definite answer to that one and then in the other the other bandits setting in this one you need no here if you just have a purely adversarial context you are going to be sunk you need the you need the noise in order to to say that you don't have to explore yeah absolutely that's very here yeah so yeah maybe in the first one there's some hope but I don't think so yeah so it fit to something that's not human yeah yeah yeah right now I think that key very that's a good question I think the answer is no I mean I'm going out on a limb and saying the only PE entities that we care to be fair to at the moment or humans but animals animals maybe yeah the animals yeah it may be possible that's Jimmy Jones because they're people I guess I know gerrymandering is an example of fairness fairness question where you're drawing congressional districts certain are we being fair to the people are being fair to the people on alternately although it sounds like we want to have fair districting so there's probably some underlying human in all of these definitions I think that that makes that we're trying to be fair - yeah something we're excited there constrictions in Bluetooth or consciousness where it would not be specifically in the abstract yeah well so I see so no but I think that's not the idea so because if you use the output as the way of definite defining similarity by definition the algorithm will be fair in that sense right so the output of the algorithm as the measure of similarity then it's going to be built in fair so the idea is to define sum again this is a question that's punted on in the paper but the idea is to define some notion of similarity that seems correct and similarity for the classification job task at hand so maybe for bank loans your athletic ability should not play a role for example right so or maybe you're looking to see how it could but so so but you have to find the right data and the right notion of similarity and then you could require that similar individuals are treated similarly yeah okay okay sorry go ahead yeah yeah if you want to satisfy these axioms here's the object if you're maximizing or minimizing no there is no such thing yet to be not at the point of a single axiomatic system or anything like that or any sort of a mainstream axiomatic treatment of fairness no you're not there yeah that's exactly right yeah yep that's the basement yeah okay thank you