okay so I'm going to talk again about
unfairness in algorithmic
decision-making it's going to be a
little bit of a breath talker than a
depth talk it's three results trying to
squeeze all of them in mainly because I
don't know how to present a series of
derivations of conditional probability
you know ever more complicated in a talk
and make it interesting so I decided to
keep it to the sort of the intuitive
picture and just give you a survey of
some results this is three different
results joint work with a number of
people most notably Jamie Morgenstern
here who's not here let's see so there
have been lots of headlines of this kind
lately that algorithmic decision-making
holds a lot of promise in situations are
that are critical to individuals such as
hiring where maybe the hope is
algorithms being objective will avoid
the biases that humans bring to this
task for lending banks and a little bit
more even more seriously you know
predicting criminal behavior recidivism
and so on based on data about
individuals so some of these articles
are written in a very hyperbolic fashion
about how the future is going to be fair
and objective and so on sentencing is
being done in this manner like how long
sentences should be and predictive
policing as well is another application
of these kinds of techniques so on the
one hand these hyped up headlines say oh
the world is going to be wonderful from
the from now on because algorithms are
going to take over but then of course
you also have you know the natural fear
is that big bad data may be triggering
discrimination because using algorithms
that are trained on traditional data
which has reflected all the biases of
society just further further propagates
those biases and maybe even a larger
scale propagates the same biases and the
other problem I suppose is that these
algorithms are black boxes that are very
hard to understand so we don't know what
methodologies are using it's very hard
to interpret the results they give or
explain why somebody was treated some
way by an algorithm so their usual
problems like this have been have also
been pointed out so on the one hand
there is this you know hype hype about
this and the other hand is a lot of
hand-wringing about how algorithms are
going to destroy us I mean so we need to
do something and so as computer
scientists I guess our goal is just
strike some middle ground where we want
to more objectively and mathematically
understand what what these algorithms do
and do not do so this is a talk like
much of this area too aimed at setting
up some basic definitions and proving
some very basic facts which are not
directly applicable but sort of laying
the groundwork for further work on this
field so what is the talk outline I'll
tell you briefly generally about what
decision-making and fairness issues are
and give you background on this talk
about one specific formulation
multi-armed bandit switch is going to be
relevant to two of the three results I
present because it's one notion of how
to make decisions in an online fashion
when there's uncertainty about the
decision making maybe some of you know
that and then go through these three
results pretty quickly
okay so decision-making background and
definitions so the first question to ask
is what kinds of decisions do we want to
be fair about and so right at the top so
if we're talking about algorithmic
machine learning decisions we want to
talk about online decision making in
batch decision making where online means
that the next decision is affected by
the previous data you've seen in the
stream of data that you're seeing over
time we have batch decisions or you've
designed your algorithm already based on
some data you've learnt from and now
you're using that same algorithm
repeatedly to decide for a whole bunch
of individuals ok so online versus batch
decisions it's in the online setting
that the multi-on Bandits formalism is
useful and in either of those settings
we could be talking about the
classification problem which is probably
the most important thing that we want to
be fair about all those examples I
showed you or examples of classification
where the goal is to determine if a
person gets alone or doesn't get alone
gets into college or doesn't get into
college these are all binary
classification problems yes or no
and the slightly more complicated is an
allocation problem where you might have
like a whole bunch of medical treatments
available and a whole bunch of patients
and your to decide which patients to
allocate to which treatment so it
differently an allocation and
classification or another example of
allocation is a conference has a certain
number of slots for talks and it has to
take a whole bunch of submitted papers
and decide which ones to accept and the
difference became classification is just
that there's a budget
an overall budget of how many things you
can classify as positive or something
like that as opposed to pure
classification where the individual
decisions are independent and maybe what
congressional district am I assigned to
and then one the one of the results I
talk about is something like a pipelined
classification where the idea is not
that so we want to consider one
particular point where a decision is
made - is it fair or not but a person an
individual through life they go through
a series of such classification they go
through whether they get into college or
not and depending whether they get into
college they get a job or not so imagine
a pipeline of classifiers and ask what
does it mean for that to be fair yeah
yeah I'm getting to that in a minute
that is going to be it's going to be a
big big topic I guess so defining
fairness is a hugely difficult task so
if some of you may be familiar with the
notion of privacy maybe the definitions
of privacy and differential privacy in
particular which has become the de facto
standard for privacy because it's a
clean definition seems to fit many
circumstances the picture for fairness
is much more messy and complicated where
there cannot be what I mean and not only
is there not byberry there cannot be one
definition of fairness that fits all
situations ok so why might machine
learning be unfair let me talk about
that first before talking about the
definitions of fairness
so again data might encode existing
biases and in fact they fluctuate them
even more for example if you use in your
feature vector of observations while in
individual the number of times they were
arrested rather than the number of times
they were convicted this is a
self-perpetuating thing where the people
who have been a
more will be arrested again and maybe
they never get convicted throughout the
times that they're arrested but it's
just bad data to be using to decide
whether to arrest someone or not so this
is data might encode existing biases
like that and of course there's the
usual there's a problem of one
directional feedback meaning only if you
give a person a bank loan can you see if
that person will repay the bank loan or
not you don't see it for the people you
refuse the loan - so you only get data
for deposit the people you're treating
positively and then different
populations may have different
properties and I'll say something more
about this in the next slide so I'll
leave that aside my definition we have
less data about minority populations so
obviously we are not well trained on
minority population because there are
fewer of them that's a definition and
then because of that maybe the burden of
exploration may fall disproportionately
on different populations but let me say
so for example this is two populations
and we're deciding whom to admit to
college and the number of credit card
sounds kind of crazy but it may be a
measure of their wealth or something
like people's wealth and people who are
wealthier may employ SAT tutor Sat
tutors and so this course of the people
with more credit cards might be higher
for that reason so you have a green
population and an orange population and
if you want to find a linear classifier
say for each one so just pictorially the
best classifier for the green population
looks like that and notice that that
classifier if we use it for the overall
population it denies all the qualified
people from the orange population
completely just says no to everybody
who's qualified from the orange
population so in fact the if you look at
if you look at the correct classifier
the orange population is lower down
there but if you wanted if you insist
passes are positive examples people who
would be good in college I mean so this
is somehow labelled data you're given
and the labels are saying that pluses
are the people who succeed in college
and the - of the people who won't
succeed in college let's say yeah so the
linear classifier of the green would do
that
and in the in the mini class for the
orange would do that and if you now if
you decide to have one classifier for
the whole population just because of the
numbers there's so many more greens than
your oranges you would go with the green
classifier and be completely unfair to
the entire orange population and so the
upshot is that we actually need to take
into account the membership the group
membership of each person or the orange
or the green before deciding what
classifier to use and being blind to the
group membership is not a solution for
fairness okay so so the question is can
be designed fair and the algorithms and
now we ask what is fairness mean and so
I would say this is very coarse but and
broad but the literature divides between
individual notions of fairness and group
notions of fairness okay so let's talk
about each so individual notions of
fairness for example for lending or
college admissions or hiring all these
things so metric fairness the notion do
- Cynthia twerk and others the idea is
trying to embody the principle laid down
by this great philosopher John Rawls in
a mathematical definition and it
basically says similar individuals
should be treated similarly and the
definition is if you take two
individuals X and X prime and think of
individuals hereafter as vectors of
features unfortunately we have to be
productive here but we're going to think
think of people as vectors of features
okay so just each person is a vector so
two individuals X that are all
individuals X and X prime look at it so
H of X is the output of the machine
learning algorithm H of X on the
individual X and one is a positive
classification is saying that the
probability that the machine learning
algorithm outputs one on an individual X
and - the difference between the two for
the two individuals should be no more
than the
students between X and X prime in
appropriate metric space this is some
notion of similar individuals so it's
kind of a Lipchitz condition on on the
classifier itself that it should it
should not you should keep two
individuals that were pretty close
pretty close in the output allowing it
to be random so in particularly if it's
a deterministic algorithm it better
output the exact same value for both
right so if it's going to say X's so
it's impossible to imagine a designer
Chicago because you have to break the
set of inputs somewhere and this
deterministic wouldn't be able to do
that in this manner satisfying this
definition right so if you keep
constraining neighbors to be the same
value and you go all over its a
connected neighborhood then you'll never
get connected yeah so you need random
basically and then there's the notion of
weakly meritocratic fairness and this is
another vexing question the whole
sadness literature which is how do we
can we in various situations decouple
fairness from merit merit basically so
the question is is fairness just the
same as saying the meritorious should be
rewarded or is it different and I don't
know the answer exactly so it's a tough
question but this definition says yes
fairness should be the same as merit and
it says that so I'll give it a
mathematically less plate where the
individual should not be favored over
more creditworthy individuals so every
times a bunch of people think of a
situation where a bunch of people are
coming to the bank for a loan the
probability the less creditworthy
individual gets alone so sorry let me
say it this way if individual one okay
let's write it down here so if you take
so this is the individuals vector this
is the true type I mean this is a why is
one if they are likely to repay the loan
and why and sorry why is zero if they're
not if they're not going to repeat it
pay the loan and in general why could be
a real number which is a problem between
0 and probability of them repaying the
loan or something and so it says the
probability that the algorithm
classifieds an individual is credit
worthy so if it classifies X is credit
worthy with greater probable
the next prime it better be the case
that X is a better repair of loans than
X prime so it should only classify
somebody as a better risk if they are in
fact a better risk right so that's
that's the requirement here so again Y
could be binary or consciousness works
in both cases the binary would be that
there they are going to repeat alone or
they are not and and continuous is this
is the probability to be able to repay
the loan so it could be the real number
between 0 & 1 or binary this is all
individual fairness yes say for each
individual not in this particular
definition yeah so there are two
critiques of these definitions one the
first one the question might be where
does this metric come from you know how
do we know what similar individuals are
and and that's a big but it's like kind
of you could say the paper that
introduced this idea of punted on that
question say somehow magically we got
this metric and now we can come up with
a classifier that does this so it's a
it's a it's again understanding this
metric may have to be studied in a
context sensitive manner for each
scenario or something like that but
there is no general notion there and in
this case this kind of thing is possible
only if there's some kind of
realizability assumption I don't want to
go into that because I'm not going to
talk about this result at all yeah
but dissimilar individuals right that
there's some subspace mm-hmm critical
yeah so what
Cynthia's answers would be or maybe
could be is that is that if you had the
right metric those orange points that
are lower down would come up I mean in a
different metric space they would be the
pluses would be close together at the -
a bit but what the right metric is or
whether it exists even is a question
that's not answered yeah yeah so it's
you're right I'm basically looking at
some subspace where they come together
which is that relevant yeah yeah but
it's yeah exactly so it's not it's not
very clear at all
what it means okay so then statistical
fairness notions though the basic idea
is that you partition the world into
groups g1 g2 up to G K of individuals
and we want equality across groups of
some statistical measure okay and these
groups could be insert the in the
standard fairness literature these
groups or protected groups by gender
race you know sexual orientation or
ethnicity whatever the usual things are
protected by the Constitution so those
might be the groups which we would want
to protect there's also some very
exciting work which is more complexity
theoretic which for example by you know
Rheingold and Roth bloom and others
which talks about every large enough
group that can be computationally
identified should be protected so if you
consider any any group of people that
you there's an algorithm that says you
can determine if someone's a member of
the group or not and maybe an efficient
algorithm and then you need to protect
you need to get good statistics on all
of them that's a nice idea that that
could also be further explored and there
are definitions like that but I'm not
going to talk about those let me just
say the base the most basic definition
is statistical parity which is so
basically saying that the fraction of
people in group I I guess I can use
the fraction of people in group I who
are classified as positive by the
algorithm should be equal to the
fraction of people in Group J who are
classified as positive for all groups I
and J so this is with regard without
regard to the ground truth about which
groups of how many individuals who are
positive so this would be okay this is
just a testicle parity and it can be
critiqued on a number of grounds as well
me so this for example is saying that
the fraction of people from Group one
who get a bank loan should be equal to
the fraction of people from group two
who get a bank loan regardless of how
many people are credit worthy in the two
groups so that doesn't I mean okay that
has its problems then equalized odds or
equalize false positive and false
negative rates and this thing is just
showing the false positive rate which is
conditioned on an individual truly being
a zero being a negative for this
classification the probability of them
being classified positive should be
equal for both people both groups so if
positive is an undesirable class like in
some situations for example are you
going to be pulled aside at the airport
security line for an enhanced pat-down
maybe that would be a positive
classification the fraction of people
from one group who are not or not
threats who are such as pulled aside
should be equal compared to the fraction
of people or stop and frisk for example
is another good example the fraction of
people from each group who are pulled
aside who are not threats should be
equal nothing is said here about the
people who actually threat who are
positive I mean who are positive mean
who are threats okay and so it's
equalized odds and then equal
calibration kind of reverses the the
conditioning there it says conditioned
on the algorithm having labeled a person
of one meaning a threat or something
what is the probability there truly a
one this is calibration or positive
predictive values a related notion and
there should be equal across the two
groups okay so so
a famous argument happened because
there's this program called compass that
was being used to predict recidivism
risk of criminals to decide if they were
to be let go or not and a public service
organization called ProPublica published
a critical study very well researched so
showing that compass was biased was not
fair because it had a higher false so if
you're talking about negative recidivism
so if your if your residue is embraced
let's call it positive it had a higher
false positive rate for
african-americans and a lower false
negative rate for Caucasians so this it
was biased against african-americans
very strongly and compasses response to
ProPublica was to say well but it is
actually quite it's calibrated and they
had the data to show that it was
calibrated okay yeah no big that's only
true of the first thing the first
positive classification the rest of them
assume there's a ground truth of which
individuals in a group or positive
examples in which are negative examples
and you better get the ground truth more
or less right in both both groups so
that's what the other other definitions
are saying which this one here
now why is fixed why is a given number
when yeah why is the ground truth
labeled examples so you have labeled
exact like my own machine learning you
assume by past history we have some
labeled examples we know it for some
past examples I mean like we'd let some
people go and we didn't see in whether
they commit crimes are not based and and
their feature vectors and we've also
given some people loans and we've seen
so history is our source of why's but
like any machine learning it's the case
right there's a training data set which
we have some labeled examples and the
test data that comes in the layer comes
in the future
I see so on the training yes yeah yeah
yeah yeah fair enough yeah when you're
labeled examples are either biased
themselves or you only have partial
feedback in the way they describe right
and for like and for things like I'm
crimes you typically only see see the
label if they were arrested and released
yeah yeah oh you could to click in the
middle one here yeah this way that's oh
you're saying this is more yeah but yeah
this is just the second one is just
talking about fitting we're finding a
hypothesis that best fits whatever
observation you're consistent with the
observations you have so far it's true
but you might question the observations
I agree yeah yeah
and so ProPublica and the compass had
this great argument where one pointed
out the flaw in proper encompass by
according to the second criterion while
compass defended itself but the third
criterion yeah
know so here is so so if you predict
something to be one right but you can
see that label Minh when after you say
it's pretty to be one yeah right in this
yeah I mean I guess yeah there's a
positive predictive value says if you
actually are going to give the Laurel do
something that can make allow you to
observe that person then do they do what
they're supposed to do with the right
probability yes absolutely right right
that's true so so again you're using
historical data that's used before this
algorithm was put in place and there are
some there are such labeled pairs
available I guess that's that that's the
model yeah or even if this system is
being used in some places in other
places it's not so we are continuously
gathering data of labels of people you
know um that way okay unfortunately
there's no way to resolve this dispute
between compass and and ProPublica
because a couple of papers prove that
it's impossible to achieve you know
equality as equal opportunity and
positive predictive values equal
positive predictive values unless you
had two other extreme situations where
the base rates of the two populations
are exactly the same or you had a
perfect classifier you've had a perfect
classifier you have no errors and
everything is good okay so really and
except for the idea of situations you
cannot get this thing so there's no
resolution to this country right that's
an excellent question and is a we are
all in several of us are working on
questions of how can we relax this and
get approximate notions of fairness for
approximately good classifiers and so on
what's the relationship between the
approximation ratio but it's not theirs
results on that yet yeah so that's the
background on fairness let me briefly
mention what multi-on bandits are next
and I thought many of you may know this
but I'm just going to assume not so the
basic model of multi-armed bandit is a
stochastic bandits case which is the
simplest model york' arms each arm has a
reward distribution D sub I whose mean
is mu sub I okay so you if you pull that
on you get a reward think of these as
casino arms or something like that
at each time I equals T equals one
through T somebody who's deciding which
arm to pull pulls on I sub T and gets
the reward drawn from that distribution
for that arm D sobriety and the goal of
all these bandit algorithms is to
minimize what's called the regret of
pseudo regret basically if he had only
known what the best expectation which
arm had the best expectation only and
pulled it all the time you would have
had an expected reward of the number of
x times the reward per time T times the
maximum of the mu eyes but instead your
algorithm got this much reward I mean
the expectation it got this much reward
in the in time I T you got em uit
okay and the difference is your regret
how much you left on the table and then
the goal is to minimize that and there
are nice algorithms that minimize the
regretted let me just go through one
quick one the decision maker simply
maintains the sample means of all the
arms they've pulled so far so they pull
each arm some number of times maintains
the sample means and then a decision
maker has to balance between exploration
and exploitation what is exploration is
pulling arms that haven't been pulled
that much for trying arms that haven't
been tried that much as exploration
exploitation is just pulling the arm
that has produced the best mean sample
mean so far you got to find a balance
and there's something called the upper
confidence bound algorithm you see B
which balances these two things nicely
and achieves order log t regret in this
case and what does UCB do it maintains
the sample means the blue
representing sample means for three arms
but it also maintains a an interval
what's called a confidence interval with
whose width depends on how many times
you pull that on the lot more times we
pull you on the narrow at the width of
the interval you get more and more sure
the sample mean is closer to the true
mean so so it maintains an interval
within which the true mean of that arm
lies with very high probability and then
it just pulls the arm which has the
highest upper confidence the highest
right end of that interval so in this
case the second arm has the highest
right end so it pulls that arm and and
and achieves Auto log do you regret so
that's the UCB algorithm and slightly so
why does bandit proms arise or why do
Ballas problems arise in fairness you
can imagine the K arms as being K
different groups of people each with
their own distribution say after a bank
loan situation each with their own
distribution of the ability-to-repay and
you want to be pulling the arm that
corresponds to the group that has the
greatest ability to repay
now here don't think of the group as
entire like one particular demographic
group because that sounds like you want
to give the loans only to one group and
not to others but a group can be
influenced by the feature vectors of the
persons involved as well
so what what the feature vectors is and
so that to make that clear we have an
extension of the stochastic problem into
the linear bandit problem okay I'm going
to slowly to administer so so you have K
unknown parameter vector Z of K arms
which have parameter of X is beta 1
through beta K and then you have rounds
T through the usual number of rounds and
in each round you have K contexts that
arrive k X 1 T 2 X K T so again this
might be the context might be SAT scores
and number of credit cards for example
so the context might be a vector of two
values SAT scores and credit cards and
the arms might correspond to different
groups that green group and the orange
group might be different arms and each
group weights these context features
differently and and gives you a reward
whose expectation is the inner product
of the context vector and the arms
parameter vector
so that's the expectations the
distribution around that expectation
that's the reward and and so the
algorithm only observes the reward for
the context for the arms that it chooses
and again we measure the performance by
regret like look at the best you could
have done for each context that arrived
at each time and subtract it from what
you actually got okay so that's Marcion
bandits briefly we'll get back to it
when we talk about particular problems
so let me get on to the three results
and talk about unfairness the
free-riding okay so it's a slight
variation on a multi-armed bandit
setting what what is the situation here
so when you and actually the learning
that's happening is across many groups I
mean it's not one individual who's
learning one decided deciding which arm
to pull but several people are figuring
out what best arms are for example think
about pharmaceutical research where
companies are trying to get drugs figure
out drugs and using the research of
another company the exploration that
another company has done could save you
a lot of money so this is the kind of
shared exploration situation we're
talking about because pharmaceutical
companies protect their IP very
seriously so it's not probably realistic
so let me give you an example so the
question we ask here is if you're free
writing of other people's exploration
can you lower your regret right so
basically can you let other people
suffer the regret and you get you get a
freebie and so so more natural more so
realistic example although not so
important example is sort of yelp
evaluations of restaurants or something
like that so you probably you're
exploring different restaurants in the
city and you want to find the the best
restaurant so slight difference from the
standard bandit setting now you have
each individual who's a diner is pulling
an arm putting an arm corresponds to
choosing a restaurant each restaurant is
an arm so at each round each diner is
pulling an arm and the information
that's generally possible to have about
diners is their context I'll get to the
context in a minute and then how many
times have they pulled each arm and what
rewards they've gotten and Collective
exploration
the spread regret but the question we
ask is let's take a very simple
sarcastic case each restaurant has an
expected reward meaning how good the
dining experience is there you have n
diners and K arms and some of the
players are public about what they're
finding so we get data about each public
players actions and rewards and just for
simplicity in the talk I'll just as soon
as one public player and there's a free
writing private player who doesn't who
doesn't share his data and maybe the
prewriting private player plays the
greedy strategy that is placed the arm
that has been played the most often by
the public player at all times so and
the question is can we guarantee the
free writing private player then has
less than a little o of log t regret and
this is true but it's not enough that
the public player is expected you know
zero regret in fact we need a stronger
stronger condition that the public
player achieves a regret that's dropping
and then so the chance of the regret of
a public player is linear drops
polynomially where this w is some number
greater than two
we need W to be at least two and if that
happens then an alpha UCB public player
would achieve this would achieve a
regret that's very small as T grows so
the free rider can get constant regret
by always pulling the arm that's pulled
the most often by the alpha you see B
player so that's what we can prove in
this to a classic case where there are
no contexts yeah that's an excellent
question in this case because the nature
of the result is you need to only see
the number of on poles you don't need to
see the rewards but that's not all in
general when I generalize that won't be
true yeah that's right yeah assuming
that the one state there's barely two of
you and the other players playing UCB
then you know that that players gone to
the best restaurant most often I mean
the very few times that they gone to a
worst restaurant and so you just go to
the restaurant which has the most rigor
most visits basically it doesn't matter
what the average review is it just got
the one that
got the most yeah okay but more
interesting cases when each arm each
diner on puller has a context vector
what is this context vector mean it
might be their preference for vegan food
or quiet restaurant these might be the
features of the context vector or you
know local produce or whatever I mean
just imagine all the features you could
have about restaurants so and the
restaurant so the free rider has a
context vector x1 let's say the public
players UCB players of context X through
X - 3 X 10 and each restaurant has a
parameter vector that talks about its
quality on each of the dimensions of the
quanta like how vegan-friendly is that
or how quiet is it or how whatever it is
and so on those would be the parameters
of the restaurant and the expected
reward when you go to a restaurant is a
linear combination and so the inner
product of your context vector with the
parameter vector of the restaurant and
so basically what the main results we
show is that if the freeriders context
vector is a linear combination of all
the other context vectors all the other
diners context vectors and it's the
smallest linear combination this vector
C which is the the coefficients the
linear combination its norm is in the
denominator so if it's large then that's
a problem so if the free-rider is kind
of you think of as roughly speaking in
the core in that in the cone of the
vectors of the the context of the other
diners then in fact you can the
free-rider can achieve constant regret
by using the observations of the other
other diners ok and the proof to prove
this what we needed to prove was that
the other diners even though they have a
favorite restaurant they optimizing
their regret and so on they will visit
the other restaurants reasonably often
so there's a lower bound on the number
of times that any other diner will go to
any other restaurant other than their
optimum because that lower bound is
necessary for the free rider to gather
enough data about the restaurant that
may end up being his best so we need
other people to be going to other wrists
but a lower bound on the other
restaurants and in fact so so what we
can show is it's a lower about and the
proof of this lower bar it's nice it's a
we show us to cast
lower bound for the other players
imagine one of the diners the other
diners we want to show that in the
stochastic bandit setting they will
visit every restaurant at least so often
even though it may not be their best and
why is it stuck at sea because once you
fix the other diners context all the
rewards become like a stochastic bandit
because you can take the inner product
to the parameter in the context vector
beforehand so it's just the circuit see
bonded bandit problem that each diners
solving and just have to argue that in
such a setting this other diners like to
visit all the other restaurants as well
and the proof is nice it's an inductive
proof that shows that there are times
beyond which this the number of times
that at time t minus 1 that that diner
has visited every arm is at least this
much so if T is greater than TK the
number of times that that diner has
visited all LK rest all sorry T is
greater than TK the number of times that
diners visited all K restaurants is at
least log of T basically logarithmic in
T with a K square in the denominator so
basically as the number of arms grows
the lower bound degrades a little bit so
but you can prove a lower bound which is
logarithmic in the number of number in
the time time duration ok so you see we
play it's just what I said earlier so
it's basically from the lower bounds we
can argue that the free rider has enough
samples from all the other players of
their restaurants to be able to judge
their own value for each restaurant in
this setting we need both the number of
armfuls by the other players and the
rewards the experience we don't we
cannot do it without the rewards and
knowing their contexts as well ok so on
to the second one pipeline decisions
when fairness becomes rather impossible
so it's pretty much nothing can be done
is there is especially if you're gonna
be exact again approximation might be
the solution so so here is an example of
a pipeline decision so these are people
taking the SAT exams and then some of
them get into this you know
versity and then I get all of them who
get into university graduate let's say I
mean so let's keep it simple
and so all of them graduate and then
they all apply for jobs and then some of
them are hired and then they get it some
of them get a job basically so the so
it's - basically it's two steps in a
pipeline one is entrance to a university
and the other is getting a job okay
that's what that's where we are and we
have a super simple model I mean like
this model is completely unrealistic and
you can shoot many many holes at it but
the idea was we're proving a negative
result and though simplicity is not by
itself a bad thing although okay I'll
leave that for later so the scenario is
the following so we have two populations
p1 and p2 and let's say that every
individual has something called their
type and the type is their only is a
scalar it's the only measure of quality
and somehow this magical number exists
for every individual ok this is all
again so highly simplified and each
population has its own distribution of
types d1 for population 1 and d2 for
population 2 so students take a test and
let's call it sat for simplicity and the
stat let's say again the big assumption
is a noisy but unbiased signal of that
type so basically the set of an
individually the individual in terms of
type T and then it outcomes t plus Sigma
where Sigma is say Gaussian distributed
or something and the college uses some
monotone admission rule based on the s
fat score there's no other data just one
eye one number thus at score and the
college makes a monitor on admission
rule which may be probabilistic or
deterministic deterministic monitor
Commission rule would just say anybody
with a score higher than this is
admitted the mono probabilistic would
say if you have a score of this much you
have this much probability of being
admitted and it would be monotone
increasing and it may be different rules
for different populations will allow
that so the admitted students get a
grade the GPA let's call it and this is
an another noisy unbiased signal of type
and condition
type it's independent of the Sat score
let's say so it's an independent noisy
signal of their type and the big thing
about the model is that the college does
not add any value in input type is the
same as the output type and it's just
College add some noise is it just
signals your type which there are there
are economics papers that make that
assumption sign we know the economists
always trustworthy right okay and then
let's say there we will assume a
rational employer purely utility
maximizing who hires a graduate if
there's a positive payoff for hiring a
graduate and I'll explain what that
means in a minute so we assume
explicitly the employer is not worried
about diversity and other criteria which
many employers are so we keep the model
simple the employer is looking for you
know some some positive payoff okay so
the limitations of the model one big
limitation is that there's no correcting
for past discrimination so assume that
the input types of the students are true
representations of their quality and
that's all we want to reason about
that's a huge limitation tests and
grades are unbiased that's a huge
problem that's been debunked I mean that
assumptions been debunked a few times on
SAT in particular but we'll still assume
it's unbiased colleges tore that value
as I said and employers only care about
utility and not diversity etc okay so
all of these are limiting assumptions
but again we're going to show even in
this setting that things are difficult
it's possible that actually if you make
things more complicated things become
possible against I'm not saying that
negative results will hold when you make
a more realistic model it's just harder
to analyze okay so so we different what
so what is the goal of affirmative
action in this setting you could ask in
this kind of two steps two step pipeline
and we define two seemingly two with one
very and seemingly reasonable goals
equal opportunity but end to end meaning
not just getting into college but
actually getting a job out of college so
so an individual of type T
should have the same probability of
being hired regardless of which
population they come from so if you have
the same quality individual their
probability of getting into college
getting admitted and getting a job
should be the same for both and let's
assume that there's no other way of
getting a job except going through
college I mean you cannot decide
whatever you cannot run around college
and get get the job okay so and then
irrelevance of group membership is
another criteria might have an employer
who has a particular type threshold okay
sorry I didn't say anything about that
so what is a rational employer that
rational employer based on observing
what an individual has done got into
college got in a certain grade and so on
it's going to have a son posterior
distribution on individuals type and
their behavior is just going to be let
me look at the expectation of this
posterior distribution if it's higher
than a certain threshold which is what I
call which is my desired sort of the
zero pay off threshold a bake even point
if the expectation the post here is
higher than that I'll hire this
individual if it's lower than that I
won't hire the individual that's how
they employer is going to operate so so
basically irrelevance of group
membership says an employer with a
desired type threshold should be able to
ignore which group the individuals
coming from although they know it that
doesn't help them in that decision it's
just the same decision regardless of
which group the individuals coming from
and then strong irrelevance of group
membership is that you might have after
college you may go to multiple employers
who all have different thresholds so you
might want to make sure that you know
the is irrelevant for all employers your
your type is irrelevant for all
employers and in fact one way to assure
that is that part the the distribution
of students admitted to college are the
same regardless of the group so the same
same type distribution for all the
students admitted from group 1 as since
the mid from group 2 so this would be
called strong irrelevance of group
membership ok so the employer behavior
again I've mentioned it's already submit
the posterior distribution of student
based on her group admission to college
and grade have a threshold T for the
desired type of employee and the
expectation of the supposed ta is bigger
than T then a higher otherwise don't
hire
so so basically we've shown in create
group line hiring your to set Admissions
and grading policy so that the test is
group independent this test should be
group independent okay what is grading
policy again grade is going to be a is
going to be an unbiased estimator of
type the only thing the college can
control is the variance of the great
okay so the expected value is the type
of the individual we are assuming so the
only thing the college again controls is
the variance so I can mention the
schools may not know what the employer
threshold is so again multiple employers
might have multiple threshold schools
have to optimize for all of them and
then we can ask that the independence of
group membership holds for a range of
thresholds for different employees with
different thresholds and what we can
show is that once you start requiring
that this independence holds for
multiple thresholds really you have to
have strong independence of group
membership meaning the posterior of the
admitted students should have the same
distribution for both groups what's your
type distribution okay so basically
positive results are very special cases
negative results are for all the
realistic cases so positive results says
that if that could be noise free meaning
it truly shows the type of the student
exactly the type of the student then we
can have all the fairness calls we want
so what does it do so and you'll see
that in a minute the other thing is if
colleges don't report grades at all
which actually some high-end business
schools do only the top business schools
through that if colleges don't report
grades at all we can achieve both you
know equality of group independence of
group membership and an equal
opportunity by setting a very high
threshold for admission to college now
I'll show I'll go through this in a
minute and and and we can see achieve
just a limited goal of Independence of
group membership meaning for one
employer threshold how to achieve
independence of group membership even if
you have grades and noisy SAP and
negative results is in the more
realistic case noisy SATs and gray
it's no monotron rule can achieve strong
independence of group membership an
equal opportunity is only possible by
denying everybody okay so that is that
that is an equal opportunity solution
right okay so so it's the proof ideas so
sorry any questions so if you have an
employee if you have a bunch of
employers and among them the maximum
threshold for type is c-plus let's say
the c-plus is the highest threshold that
any employer has so basically if the sat
is noiseless I mean it truly shows your
type then schools admit everyone with
the score is greater than C plus means
they're gray the type the type is better
than C plus an employer hires everyone
the school admits okay and so this is
this is equal opportunity because if
your type is higher than C plus you'll
get a score that's higher than C plus
and you will be admitted and you'll be
hired so independent of the population
you belong to and you can check that
it's also independent of group
membership because again this clear in
the no grades case also you can achieve
you can achieve both because what you do
is you set the admission threshold so
high that for every group the posterior
distribution of the type assuming that
score of the individual is better than
that threshold has a mean which is
better than c-plus so again you set your
threshold very high and so you admit the
best students who have high types and
then every admitted student is hired and
achieves both objectives okay
okay so deterministic monotone threshold
rules maybe I'll just get to two results
and I'll stop there because I'm not you
know I am the time so basically the
sketch of the ideas is that we look at
this what is the posterior pops what is
the posterior distribution look like
this big quantity here the expectation
of the type of an individual from
population I see their score was better
than the threshold and their grade with
some G and we show some nice properties
of this posterior that it's continuous
differentiable and increasing in all of
these parameters and also that as the
grades go from minus infinity to
infinity if that's possible
the the type expectation goes from minus
infinity to infinity as well so this is
a nicely behaved posterior function
what's your expectation function and so
you can achieve in the independence of
group membership for a particular
threshold in the following manner
suppose the threshold is C what you do
is you find you find thresholds beta 1
star and beta 2 star for admission to
college in such a manner that there
exists the same grade at which this
posterior expectation as far as the
employer is concerned I become C how do
you do that you fix the threshold of
admission for one group and you move the
threshold for admission for the other
group and because of continuity and
differentiability and so on that'll be a
setting at which they will be equal for
the same grid and and then you can get
independence of group membership so this
is all elementary consequences the
strong independence of group membership
is just not achievable and follows from
kind of similar arguments as the
Kleinberg Malena tan and rag of end
result okay the third result was let me
just say what this result is without
actually going into it perhaps Jamie
gave this explain this result in her job
talk because this is very much joint
work with her so let me just not use the
slides anymore and just talk about it
briefly so there are situations where
exploration seems unfair to the
individual on hand so one I mean one
example is in medical trials so you have
a bunch of different drugs that you
could give to a patient and these are
represented by the arms each with its
own parameter vector that we don't know
the efficacy of this medication and we
are learning these parameter vectors as
as if and individuals pass through the
system a new individual comes along they
have they represent by a feature vector
one possibility is to run a UCB type
algorithm and sometimes giving them a
treatment that may be suboptimal and
expectation based on our current
knowledge just on just in the interests
of science so that we can learn better
what the values of the treatments are
but that doesn't seem fair to the
individual in front of us so you might
want to just give them the best
treatment by expectation no exploration
just exploit what we know
and give them the best treatment based
on what we know and then the danger is
of course that can incur a lot of regret
this is well known in bandit literature
so that's one example another example is
that the myopic agent like for example
an Airbnb landlord might not want to
explore different groups that they don't
know much about they just want to rent
to the people that they are aware of and
have good data on and so they might
discriminate against other groups so
again they are they are not exploring
for a different reason
I mean so the two different reasons
where exploration doesn't happen and so
there's a greedy algorithm that seems to
come into play in the situations and we
ask ourselves is that bad always I mean
it seems like in general it's bad and
what we show in this result is that if
there is sufficient diversity in the
data in the following sense an adversary
chooses the feature vectors of the
people coming through the system but
there is a random perturbation of some
with some variance and no it's a
Gaussian perturbation of the vectors and
if the perturbation has sufficient
variance in it then in fact greedy will
will achieve optimal basically optimal
regret so in fact it's okay to be greedy
and get the best solutions so so that's
the result which I'm not going to
explain but so that's basically it let
me see if I have any conclusions I don't
think I'd have any conclusions but yeah
so this is what I was going to get more
technical on but I want so let's see
there yeah okay let me just conclude by
saying that you know fairness is is a
very very kind of loosely defined space
with lots of possible definitions it
would be nice at least to consolidate
some of these definitions and understand
a smaller set of definitions that we
need to work with to understand fairness
more fully but we are not at that point
yet we are just throwing up definitions
left and right maybe eventually there'll
be some consolidation and some
understanding better understanding we're
early days yet okay stop here
so rather doing distributions for Euro
types or data hmm
you could try to be fair in the sense
that you're always trying to make sure
you're as fair as possible compared to
let's say the future right right right
right the adversarial settings in
another yeah because the regret bonds
are worse but but that's fine and but
yeah yeah unfortunately in that setting
is probably not possible because the
geometry of the problem is such that the
freeriders context vector is going in a
certain direction and maybe the
adversary yeah good my instinct is that
it's not it's not possible but I don't
have a definite answer to that one and
then in the other the other bandits
setting in this one you need no here if
you just have a purely adversarial
context you are going to be sunk you
need the you need the noise in order to
to say that you don't have to explore
yeah absolutely that's very here yeah so
yeah maybe in the first one there's some
hope but I don't think so yeah
so it fit to something that's not human
yeah yeah yeah right now I think that
key very that's a good question I think
the answer is no I mean I'm going out on
a limb and saying the only PE entities
that we care to be fair to at the moment
or humans but animals animals maybe yeah
the animals yeah it may be possible
that's Jimmy Jones because they're
people I guess I know gerrymandering is
an example of fairness fairness question
where you're drawing congressional
districts certain are we being fair to
the people are being fair to the people
on alternately although it sounds like
we want to have fair districting so
there's probably some underlying human
in all of these definitions I think that
that makes that we're trying to be fair
- yeah
something we're excited there
constrictions in Bluetooth or
consciousness where it would not be
specifically in the abstract yeah well
so I see so no but I think that's not
the idea so because if you use the
output as the way of definite defining
similarity by definition the algorithm
will be fair in that sense right so the
output of the algorithm as the measure
of similarity then it's going to be
built in fair so the idea is to define
sum again this is a question that's
punted on in the paper but the idea is
to define some notion of similarity that
seems correct and similarity for the
classification job task at hand so maybe
for bank loans your athletic ability
should not play a role for example right
so or maybe you're looking to see how it
could but so so but you have to find the
right data and the right notion of
similarity and then you could require
that similar individuals are treated
similarly yeah okay okay sorry go ahead
yeah yeah
if you want to satisfy these axioms
here's the object if you're maximizing
or minimizing no there is no such thing
yet to be not at the point of a single
axiomatic system or anything like that
or any sort of a mainstream axiomatic
treatment of fairness no you're not
there yeah that's exactly right yeah yep
that's the basement yeah okay thank you