But thanks for coming everyone I see a lot
of the computer architecture students
here and that's good so part of part
of my goal here is both to get you
interested in the sorts of things that we
care about and also to reprimand you for
doing a crappy job building machines so
I'll try to do both but
I'll start nights so
the thing we're interested in so
I so I For those who don't know
me our lab is been working
on making codes really really fast and
of course people don't care about speed
anymore it seems that actually what they
care about is energy and power so just as
an example my phone I unplugged my phone
at ten A.M. this morning it's two P.M.
now and the battery is down to
about a third of its capacity so
that really irritates me that's a big
problem but it's not just on these devices
of course the systems I care about
are supercomputers and supercomputers
are projected to sort of the next
the system that we're supposed to build in
two thousand and twenty if we build it
out of the components we have today
will cost somewhere on the order of four
or five hundred million dollars a year
just to turn on and so
that when you haven't even done anything
useful you haven't written any software
you haven't hired any staff you've
just installed the machine you probably
haven't even cooled it at that cost so
this is a huge huge problem power and
energy and what we're wondering about is
does power and
energy have any implication for
those of us who care about writing
software and designing algorithms.
And I don't know the answer and that's why
I'm here and my you know my students and
I are here asking for
your help in your input and your feedback.
So since it's it's mostly
students in the audience.
I want to start with the one thousand nine
hundred five A.C.M. doctoral dissertation
Award winner so this was Danny Hillis
how many of you have read this thesis.
You know those of you who are in
computer architecture and you haven't
read this thesis especially if you're
interested in parallel computing you.
Be ashamed of yourselves this is
a fantastic Lisa's to read here is
the reason why this is why you should read
it it's the concluding chapter of this
thesis which is new computer architectures
and their relationship to physics or
why computer science is no good OK So
Danny Hillis was a computer
architect he designed a connection with
machine you know one of the early really
major computer a parallel computer
architectures and he was reprimanding
the algorithms people he was saying you
guys don't know what the heck you're doing
you're designing algorithms with
completely unrealistic cost of models and
in particular he was reacting to a model
cold P RAM How do you know what P.
RAM is OK so he was saying you
guys designing P RAM algorithms
you're crazy and because at the end of the
day you have to take this algorithm and
run it on a physical machine and P.
RAM ignores all cost in particular ignored
communication costs but you have to
pay the cost of communication so
if you ignore it not only are you
going to get an algorithm
with crappy performance but you're also
not helping the architects right so
I as an architect or you as an architect
I'm not an architect you guys
you guys architects you want to help us
you know build faster more efficient time
in time and
energy you build better software.
But you can't do that if everything
I run is not really designed for
the machines that you can build.
So this was a call to the algorithm Asus
to really rethink the way they thought
about algorithms.
This is the most inspiring kind of I think
you should definitely go read it and
when you're sitting down to write your own
pieces by comparison I think was extremely
boring I wish I had thought to be as
provocative so anyway someone so maybe
some of you will take some inspiration
from this OK so in my lab the three three
people who are going to write the season
finished soon are counterpart gene so
what I'm going to do in this talk is I'm
going to tell you our collective something
about our collective thinking about
time and energy from the perspective.
Projects that they've been working on and
these are all well except for
Aparna these are all relatively
early stage things and
so there's an opportunity for some
feedback and interaction so I'll start
with a partner she's sort of the most
senior so she had a chance to work on
a team led by George bureaus which won the
two thousand and ten Gordon Bell prize so
for those of you don't know what
that is that's sort of the highest.
Performance demonstration award in the
area of high performance computing this
was for something that simulated
red blood or red blood cells and
Aparna worked on multi-core scaling so
there was an algorithm and it scaled
to the entire system pretty well but
single node performance was terrible
she came in and she fixed that.
And there was a lot of
performance engineering involved.
So then after this performance engineering
exercise she sort of stepped back and
she said whoa so so what did we learn from
all this Aparna sitting in the back and
she wave your hands or we can see you.
So she stepped back and she said so so
did did we actually learn anything and
she started thinking very hard in
an analytical way about the fundamentals
of the algorithm how many operations as
a do how much concurrency is there and
how much communication does it do so she
did an analysis of a particular algorithm
called the fast multipole method this is
an optimal so linear time approximation
guaranteed approximation algorithm for N.
body problems.
And she analyzed it for a cartoon many
core architecture so this cartoon
architecture has you know some number of
cores a large share of local memory so
think for the architects think last of all
cash and then some slow down three and
she looked at the algorithm account
at how many operations this is do
how much parallelism is there and
how much data transfers back and
forth between say main memory and this
local fast memory where you do on it and
she wrote down some horribly
complicated expression
because I'm too stupid to understand
long and complicated things I.
Boiled it down to its essence So
essentially what
she showed is that execution time is
basically this this little formula so
I walk you through the formula tell you
what it says I mean there's something very
interesting and I think very deep.
Factor is the speed of light so
if you if you didn't have to do any
communication whatsoever this is how fast
the algorithm an algorithm would run and
you can see the form of this formula
is speed of light plus some factor so
what is this other factor is this
factor is the communication penalty.
OK And in this case it turns out it's
basically just the ratio of peak
flops the processors peak flop rate that's
floating point operations per second
divided by the peak bandwidth
of this communication channel.
And you'll notice the size of this total
memory Z. This is not here and that's
because it's a low order term and so in
simplifying I just sort of dropped it so
that's a hint for those of your heart that
says that for an F. and them increasing
cache size doesn't actually help us
right it's it's completely irrelevant.
Given it's likely caches are large enough
today that this doesn't help us so
much OK.
Penalty is very interesting so
this is the balance ratio of the processor
of the flop to byte ratio of the processor
and so the architecture people I'm sure
you're very familiar with this sort
of concept of a balance processor.
And the be the troubling thing about this
ratio is that it doubles about every
four years so this communication penalty
is getting larger and larger and larger So
even though the speed of light gets
faster because the processors get faster
the communication penalty
gets larger with time OK so
today it turns out it doesn't
matter because you know
when we did this Gordon Bell thing it
scaled to the full system it was and
you know it's very very fast so
it seems fine but this one.
We're going to be in trouble if
this trend continues so one.
Question for the architects is is there
anything we can do about this trend
to change it in some
fundamental way OK And
there are lots of ideas things
like stack memory and so on but
do they really do they really address
the problem to really make it possible for
this to scale linear but
essentially stay constant over time OK so
that's a problem processor
balance ratio so
on the algorithm side is there anything
we can do and it turns out there is and
it is kind of interesting what it
is you can do to compensate for
this growing balance ratio and
that's to drop the accuracy so
basically according to this
expression the main knobs for
compensating for growing imbalance
over time this read factor
is to drop bits as many bits as you can.
So so so it's sort of time for
us as algorithm assists to stop and
say how many bits do I really
need absolutely need is there
anything I can do to get rid of it and in
some cases the answer will be nothing and
in other cases if this is for and
ask physics simulation you'll probably be
happy with one digit of accuracy you
can basically drop all the bits.
OK So so
that's that's sort of the prognosis and
so this is this is sort of the our first
cartoon or our first hint that if we
go back to algorithmic first principles
we might be able to tie them directly to
architectural parameters and then we
have something interesting to talk about
right those of us who do algorithms
of those of us who architecture.
OK So so
that's that's sort of the goal let me
leave you with that one other little Q.
analysis that upon us done which is to ask
how much time is spent doing flops versus
communication and it turns out today
all the time is spent doing flops and
that's good that means that it will scale
but sometime between two thousand and
fifteen and two thousand and twenty will
stop scaling basically will be limited by
communication you know people have
been studying the earth and them for
a long time we are very surprised to hear
just they just sort of assume that an F.
and them would be scalable.
For a long long time to come but
the window is closing unless you
guys do something about this.
OK So that's the sound of any
any questions at this point.
Yeah.
That's right.
That's right.
That's right that's right thank.
You.
Probably not you know it depends on and
they had to cross over the imbalance means
the crossover will grow over time but
maybe it doesn't grow too fast if the data
is really huge then yeah that kind of
difference I don't expect you know and
squared algorithm to beat an analog and
one no matter what the communications.
That's right that's right.
That's right that's right that's exactly
right so it and in fact that's apparently
that's one of the fundamental principles
of parallel computing is you try
to come up with parallel algorithms
that are work optimal so
they don't do more asymptotically more
work than the best sequential algorithm
and that's exactly our starting point
here so you can't beat when your time on
this guy well I mean you can if
you throw away data I guess.
OK SO GOOD other other questions.
Yeah.
Right so so that I think that's.
The Basically they would so so
that analysis is characterized
by the ratio the ratio of essentially time
to two flops and time to communicate so
probably as you drive to
a lower a processor with lower.
Lower performance you will that
ratio will tend to get smaller so
that communication penalty
will tend to decrease and
that means you'll tend to shift this lie
this crossover point into the future.
That's right that's actually a good thing
if you need more than one processor so
if you only need one processor
that's a terrible thing because
if you're only one processor you
only care about how fast it is so so
that's so so
there are some negotiation there but
yeah if we're thinking about designing
a supercomputer or sort of low power low.
It would still be nice if there were
high performance I mean there will
be sequential parts of code and
so on and so forth.
But balance is the key
according to this analysis.
It's OK Other So these are good.
Until you can write down the communication
costs it's impossible to
say.
Right.
Yeah yeah and
that would be great it would be great so
I actually think that I thank you for
reminding me there's a so
there's actually something very important
in this previous slide which is.
That this accuracy parameter enters
super linearly right so even as a small
reduction you know if there's a linear
increase in imbalance a relatively small.
Decrease in accuracy
you can compensate for
that because of this three
have now three has a small but
young in a case of machine learning maybe
you only like astrophysics Maybe you need
one digit you know you can drop lots
of bits and they're big wins so yes.
OK.
OK So this is all just by way of
sort of background motivation but
the title promises to say something
about the relationship between time and
energy so that's that's kind of
the interesting new thing so
let's see if we can say anything
concrete about that OK so
now I'm going to switch so
Aparna is now about to graduate so
we'll stop talking about her work
she can do that on her own and
we'll start talking about the next
person here yeah various.
OK so so so now I'm going to talk about
what he's been thinking about which
is analogous to the roofline model for
performance analysis how many of you know
what the roofline model is if you don't
know basically walk you through OK Nobody
nobody knows that's that's too bad all
right so let's go back to the cartoon so
the cartoon is I have a processor local
memory large a small fast memory and
large slow memory and lets suppose
I can write down an algorithm and
I can count how many local
operations it does and
how many words of communication it
moves right so we call these flops
on processor and mops and
memory operations in between all right so
what's the execution time more we need
some costs so let's suppose we let's
suppose we know the cost of doing
a flop and the cost of doing a lot.
Then we can calculate sort of the total
flop time and the total MOPP time and
if everything is happening
concurrently we basically will
pay the largest of these two right so if
we ideally schedule and overlap everything
then basically the total time will just
be the max of the component times OK So
so far nothing controversial OK good so
just do a little algebra so
again like a part of analysis I'm going to
pull out this speed of light factor and
again you can see there is
a performance penalty factor.
And this factor is very interesting there
are two ratios and so the first ratio is
the blue part Q Over W. That's one
over what we'll call intensity so
this is the flop to bite the inherent
flop by ratio of the algorithm OK this is
a property of the algorithm alone the read
factor is a property of the machine and
it's again it's that balance ratio
that we saw in upon us analysis OK so
it's the ratio of these ratios that tells
us something about the magnitude of this
penalty and if it's greater than one then
we'll be limited by communication and
otherwise we'll be dominated by flops and
that's that's what we want OK so this be
this is the balance Rachel All right so
this is time this is the story for
TIME So what's different about energy.
With the biggest difference in
energy other other than the cost
of energy energy here I really mean
jewels jewels per operation or
jewels per byte what's the big difference.
OK the sum not the max All right I
can't overlap energy I have to pay for
all of it it seems like a small difference
but it has big consequences OK So here
I've just put down the some ivory factored
in the same way it has the same functional
form it's just some instead of Max OK so
now I'm going to plot these and just.
Going to do something a little bit funny
I'm going to actually plot the inverse so
I'm going to plot performance so
one over time and one over energy
normalized by the speed of light and
sort of the maximum energy efficiency and
when I do that I'm going to get two curves
so the top line is the performance line
that corresponds to time and it looks like
the roof line so this is sometimes some
people call this the roofline model this
is Williams Patterson the water Waterman.
And the blue line is that
analogous line for energy and
of course it's smooth
because we're summing it so
taking that extra sort of sharp
inflection versus the smooth thing.
And to get some actual curves.
We went and we read a paper
from the people at a video and
we sort of pulled out their
flop to bite ratio in time
versus the flop to bite ratio in energy
and we use those to plot the curves and
this is these are the curves that you
get so the balance with respect to time
three point six flops provide
balance with respect to energy
fourteen point four flops per byte so
about a factor of four bigger.
OK So flying What does this mean does
this say anything interesting so
so that algorithms people in the audience
you write down your algorithm you analyze
that that puts you at some point on the X.
axis and if you want to know how fast you
could run if you tuned everything and
you wrote the optimal code you draw
a straight line sort of to the top
right for fixing the algorithm.
So so what you would put you so
some sixteen would put you and you know
somewhere on the red line and somewhere on
the blue line it's the same X. value OK
so so so what does this tell us.
So this says something this has
a bunch of really interesting things
in particular this region in the middle
the region between B T and B.
what the heck is this OK this is
a really this is kind of a funny zone so
this Y. axis is logged too and
you see where the half is so
half means you're within half
of the best possible right so
either I met half the best performance or
I met half the best energy efficiency so
I think of this as the crossover
point where on this side.
Communication dominates time and energy
and on this side flops dominate time and
energy and you sort of want to be to
the right if at all possible OK this zone
in the middle it's funny it says that if
I'm somewhere in here I could be running
as fast as possible I'll be at
the top of the red line was one but
I'll still be under the fifty percent
energy efficiency mark right so
this says that there
are there are two notions of.
Compute and memory boundedness I can be in
here I can be compute bound in time but
memory bound in energy and this says that
this is where I should expect to see
some funny behavior that I'll see
programs that are very very fast but
they use a lot of energy maybe
more than we would like.
OK Because essentially because
of communication or so this
is kind of a funny funny little zone so it
says that there is a if there's a gap and
ve is bigger than B T then we
might be compute bound of time but
memory bound in energy.
This is no so it's abstracting the
algorithm only by its intensity of the X.
axis and for a particular
architecture we draw these two lines
OK So so this is this clear so far.
OK So all right so one thing is this
funny gap so the other thing is.
The fact that B. is larger than B.
T. says it's harder to optimize for
energy than it is for time so optimizing
means I have a code that's maybe on
this axis and then I want to get what I
measure its actual performance I'll be
here if I performance tonight
I'll hopefully hit the curve.
When I change the algorithm I'm going
to try to increase the intensity
without increasing the work.
And in order to do that I can do that for
time and get to a certain point where i'm
time optimal but then I have to keep
working in order to be energy optimal.
OK so if it's true that on real systems
be is greater than BT than this says that
optimizing for
energy is harder than optimizing for time.
Not yet not yet well I mean we have we
have guidelines for increasing intensity
of basically it's reducing communication
that's that's basically it and maybe
reducing accuracy as we saw in apartments
example then would be another way so yes.
So far so far this doesn't tell us
anything about what to do it just says
that with the direction you need to go in.
OK.
Let's see so here's the other thing that
it says that I think is very interesting
I suppose you're able to hit one of
these lines either this roofline or for
energy I'll call this an arch line.
If I have an algorithm I managed to get my
algorithm on this side of the energy line.
Then it means and
I manage to hit the line curve
then it's very likely that I'm also time
efficient right because B. is greater than
rights it is the corollary to what we just
saw in the previous the previous bill.
OK so.
So this me so those are so how many of
you have heard of this energy time energy
optimizing strategy called Race to halt.
OK So so
what's race to hold what is I mean.
OK so the theory behind race to a halt
is you want to save energy just run as
fast as possible shut everything off and
then when everything is off you're
not burning any energy so this says.
This may not be enough.
OK if the whole works then maybe it's
because actually these are fluent
All right so
that's the sort of a corollary.
Race to halt.
Well because if.
I could be in this this funny zone.
Yes.
OK.
OK So so this is this is some Him So right
now this analysis still very simple and
still very abstract but already I think
we're starting to see old There's
the makings of some interesting
parameterizations so far protects I think
an interesting question is these beasts
what the heck are these values and
what are the trends in these values how
do these values change over time so
we saw how BT changes
over time I said that.
Doubles every four years what about B.
Are these going to meet or
are they going to cross over if the switch
then I as an algorithm designer need
to focus on optimizing time and
I'll get energy for free but if they're
like this and this gap grows then
that means I need to stop optimizing for
time and in fact optimize for energy OK so
what we should do I don't know so and
so architects What do what should I do.
You know what the BT or.
Any thought of any guest is.
It's a democracy should we vote.
You know OK.
Yeah.
Sort of was so we'll get to that So this
is so right now we're just talking about
a model and it's the cartoon model where
we take the time model and just translate
it to an energy model so I'm assuming that
these constants actually are real and
they can be accurate to
accurately estimate it in some
way.
So yeah so with respect to this
model we haven't done that yet.
But I will show you some micro
benchmarking data in a moment to sort of
help you could help convince you that
this is not that this is real OK.
Let's see All right so the other thing I
can do is I can ask about power so this
is about time and energy power is energy
divided by time so I can just divide
the two curves basically and you can ask
what those that give you that gives you
another curve I'll call the power line
because it looks like power lines.
And you know again these balance points
they'll be critical points somewhere
in this curve that will say
something about the behavior so
in this case where B E is greater
than BT It says that basically.
Your communication bound you're also
going to need a lot of energy and
you're going to have to increase
that power in order to go faster but
then once you become compute
bound in time then you'll start
reducing this overall power consumption so
if I want to reduce both power and
energy then that I need as an algorithm
person I need to move this way.
OK.
So that's that's just power well let's to
for additional discussion for the moment.
OK So all right so right now
the algorithm is still abstract we want
to say something maybe a little bit more
here's just an example of something we've
been thinking about doing an analysis for
and I don't know where this will go yet.
But one class of interesting computations
are those that exhibit a work
communication trade off OK so
maybe I maybe I have a baseline
algorithm called algorithm one and I can
give you a new one and the trade off is
that the new one has lower communication
at a cost of higher flops OK So
multiply flops by a little F. F.
is greater than one and reduce Q.
by and that's greater than one OK so
if I have a work communication tradeoff
then what happens so
we can ask do I get a speed up.
Or do I get an improvement in energy
efficiency I'll call this a green up
I think we made up this term
I haven't seen anyone use it.
OK And we can sort of do
some analysis about this so
we're not going to show you any details
I'll say so for example suppose all you
care about is getting a green up then
what is the condition on F. and M.
that guarantees you'll get a green up and
if you just calculated a take you like
three seconds worth of stock yourselves
you get you know one plus something or
another and
this expression has some structure for
example that says If I reduce
communication completely so
am goes to infinity I can do more
flops but I can't do too many more
flops the number of flops I can
do will be bounded by one play.
You know be over or
whatever my baseline I was OK so
and probably if you just stare that person
can be like yeah that makes sense all
right.
Yeah there are many examples
where this is possible so
we may have found them algorithm for
example that that a part has been working
on there are ways to change the structure.
To reduce communication increase
computation stencils you can do this.
Yeah you know when discrete steps but
yet there are many classes of
algorithms where this is possible.
OK.
OK So so for
example it says if I'm let's say
my baseline algorithm is right at
the inflection point of the roofline and
if I plug into the formula it says I can't
do more than about one plus this ratio or
five times as many flops OK So
this is something as well you know that
I mean this is just the analysis I
don't know where we're going with this
I guess we should test it probably but
you can imagine playing these games
there's more here but I'm going to I'm
going to skip this part you can sort of
say when do I get a speed up when do I get
a green up when do I get
both When do I get neither.
You can ask and
sort of find zones so these points
these are not measured data points are
just plugging in different values of F.
and M. you can derive bounds on all
these things so on OK So anyway so
so that's the work communications tradeoff
story that's one direction that we're
going to you know we're going to
do some algorithmic analysis in.
And the simple those simple cartoon Lines
tells us something interesting will happen
OK So unfortunately I swept under
the rug a huge huge factor on
real systems and probably the architects
of and thinking this along the way he's
going to say something about static static
power or idle kind of idle energy so
this is the energy you burn in
just by beat the fact of being on.
During the time the.
The computation executes even if you or
even if there was nothing happening
right there is still power
being fed into the system and
they're still energy being burned OK So
this is baseline sort of static energy or
static power so we could throw that
into the model too so one simple
way is just to basically say there is
some static power whatever it is and
for the time that I'm running I will
pay an additional energy cost so
you can plug that in the kind
of work through it again.
That's.
OK so what G.'s been doing over the past
few weeks is writing some micro
benchmarks to sort of test all this stuff
I just got a notification of my battery is
about to run now after four and
a half hours or so
so this is so this is a micro benchmark
that basically does a bunch of flops and
does a operations that's all it does so
synthetic benchmark
that allows us to
artificially vary intensity.
OK so let's see what happens and so
with here are two platforms a G.P.U.
platform and a C.P.U. platform so
let's look at the G.P.U. first so
the solid lines are our model
curves including the static power
source term and the DOTs
are are basically the measurements.
And in the in the eyeball norm.
These are pretty good match.
OK.
Right and that's because.
For time we know the time
constants from the specs for
energy we don't know the specs so
we fitted them but
we fitted them to this three term model
with the idle energy so it was just some
you know regression fit basically So
that's why it looks like it hugs.
But it gives it but at least it
says the functional form is on.
Its feet.
To these points and
you know you get the energy constant so
then once you do the fit now you
have the energy concepts and cells.
Yeah it's so the right hand side is
a different benchmark actually that we got
from Iraq who's sitting in the fourth row
and so there's a question about tuning and
whether we can increase it may be that
the vendor specs are too optimistic so
we haven't looked at that in detail yet
but.
But at least it of if
alos the trend OK but
that's supposed to be you're
not supposed to look at that.
OK So so the so I'm showing the fits
now you'll notice on the energy there
are actually two vertical lines that are
drawn one is the actual balance point so
the actual fifty percent mark
the other is the balance point
if static power word zero.
So this says that basically the thing that
we've built should obey this trend that we
saw before and
the fact of static power actually pushes
this balance point to the left.
So this is why we think people
have observed race to halt
it's not because of race to halt is real
I mean it's it is real because it is
because this is where it is but.
This is why the race to halt
strategy works today so
you know the other interesting question
to kick back to the architects is if I
drive static power to zero.
What first how close am I going to get to
do it doing that and if that is possible
then that means this sort of analysis
with respect to time energy tradeoffs may
become more interesting at the algorithmic
level if not if this is just a fact of
life and it will always be like this then
that's actually good news for me it means
I don't have to do anything differently
than I've been doing in the past and
you just keep optimizing
basically the same way I optimize.
So from a research point of
view that's less interesting so
I'm kind of hoping that you guys will
drive static power to zero but if you.
Don't This is this is what will happen
yeah.
It's just doing multiplies and adds it's
you know some fully unrolled loop you know
just getting rid of all this and
you know extra instruction overhead junk.
I don't know if you want to add
anything else to you about what and
what you have to do you
know it's pretty much
an.
Right right yeah yeah that's a so the I
mean the marker this is this artificial
micro benchmark it will be hard
to see these kinds of facts.
And that's I mean the model is
assuming some kind of perfect to me.
So you know with a micro benchmark that's
easy to do with something real will be
harder so yeah so it's possible some
of this will break down this is one
of the limitations of the analysis of far
but it we're just sketching an idea it's
like OK let's write down some models let's
put them let's kind of play with this and
see where it goes so what the implication
is for future future design.
OK Other other comments before I move on.
Right.
Let's say OK so this I already said Now
then there's a similar experiment on
a C.P.U. The fit is not as good so there
may be some more work to do there still.
But but the trend is in fact reversed
in an even stronger way so the time
balance point is it's for there to the
right to bend the energy balance point.
So that's what it is yeah.
That's right.
That's right that's
right.
Yeah you know I yeah I mean the yes so
I basically I think you
said the right thing.
On this right hand side platform
are much before your and
they do a lot more stuff it's certainly
not stuff that's exercised by
the micro benchmark but
you have to pay those overheads anyway so
you know that causes the ship.
At this level we can't map to finer
grained features I don't think there's
enough stuff in the benchmark to do that
but that is one of the things that G.'s
been talking about doing so you know
separately accounting for the energy to
cash it at various levels you know
doing all this other kind of stuff so
one could certainly imagine doing that.
OK.
So I don't know architects look bored
maybe they're like This is hopeless.
Or if it's so the only thing I'll say
about the power of this is just dividing
the curves and
dividing the energy and time curves.
The fit is not as good on the G.P.U.
side in particular there are some
power throttling that's happening.
So based on the fitted constants
you know you could read you know
you could burn two hundred eighty watts in
the in the worst case but there are some
some hardware throttling that forces that
to not happen so there's some cap and
that's that's the cause of this gap here
so that's something not accounted for
in the model and certainly power
throttling will only become more and
more important as we go forward so.
OK So all right so so
the main thing I want to leave you with
with this what sort of first part and
since we're running low on time I'll
do the second part in basically five
minutes is just sort of this idea of
you know writing down to first principles
model and really trying to map.
Algorithm in architectural features
together what I think we could all do
together if this program kind of
works is sort of the next part.
And this is.
Thesis so he's been thinking kind of more.
In sort of a different way about a code
designing algorithms and architectures and
he has a model you could think of it as
something kind of like what we wrote down
in the previous thing but
with tons of other stuff in it and
in particular he wants to reason about
power and Diario constraints and
what implications that has for
algorithms so
let me just I'll just talk very briefly
about the thing he's he's been doing and
show you what I think is one
very interesting picture.
But just sort of by way of motivation so
the first part was sort of bashing
algorithms people so now it's time
to bash the architects So this is so
it back when I started this job or
soon thereafter I went to a meeting and
it was in some beautiful setting in the
mountains and we were going on a hike and
here I was with somebody from Nvidia and
someone from Los Alamos the guy from
Los Alamos was the chief architect of the
Roadrunner system which in two thousand
and eight the month before this meeting
had just become the number one machine on
the top five hundred list it was built
out of cell processors OK So sort of the.
G.P.U. colleagues basically
are heterogeneous computing colleagues and
we were hiking and we were going down we
didn't have a map we reach some point and
it was just kind of funny that there was
a high stakes direction that's where
the black diamonds were and there was this
other sign pointing toward easier route
and you could see them looking longingly
off in the high stakes direction so this
is in my view how you guys like to think
of yourselves Here's what we actually did.
We took the easier route and
here's a guy a crazy crazy
still looking off to the right.
With his brain but clearly his body
will be in the other direction so so
anyway now so I said Of course.
They say as well we were a bunch of we
were with algorithms and software people
so we had to take the easier route you
know fair enough OK So but this is kind of
that in some ways this is as an algorithms
person this is how I view architecture
sometimes and you guys do stuff and
you throw it over the fence and of course
what I tried to say in the beginning is
maybe there's not enough communication.
Interpersonal what we really need at
this point is some sort of map and
maybe a country that would tell us how
to get to the bottom so what canst been
doing is building this map and I'm going
to show you the map and it's been or
at least the first one and it's pretty
interesting So here's the notional design
problem he's been thinking about he's
like OK we're going to build we'll start
with one processor and eventually we'll
think about an entire supercomputer and
we impose some constraints on the system
that we're allowed to build one has power
they can't use more than a certain amount
of power and the other is that every
processor that's part of the system only
has a certain number of transistors on it
so this is a resource allocation problem
or a constrained optimization problem.
What I can do with power is maybe
I can give you more bandwidth or
maybe I can give you higher
core clock frequencies and
what I can do with transistors is I can
give you more cash and I can give you or
I can give you more cores and
these are fixed constraints I have
to allocate them in some way between
one or the other and notional X.
is this means there is a space
of all possible machines OK and
some will have many cores that are slow
and some will have a few very very
fast cores and
lots of cash with the lower right corner.
And given an algorithm what I can imagine
doing is running this thing on all
of these machines and finding the best one
maybe the best one is inside the bull's
eye OK And these are think of these
contours as sort of contours of I so
performance All right so so
imagine that we take the models and
then we saw in the previous thing that
we take the algorithm analysis that our
partner did that sort of sort of I lead
off with and then we enrich them also to.
To reason in some way about area and
we put this all together so
we have some model of time and
we have models of power in the area and
we solve this constrained
optimization problem.
OK so you imagine formalizing
this the research is all about
the models how to what do the models look
like what terms they have to account for.
And here are here are two of
the maps that he's built so far so
we're just starting this OK so
one map is for the algorithm
which is matrix multiply and the other
map is for a three D. F.F.T. and the main
difference algorithmically Is it a matrix
multiply is very compute intensive and
does relatively little communication in
the three F.F.T. is just the opposite.
OK And you know it first there sort of and
so the optimal point is the DOT that's
labeled one in the two plots that's
the fastest machine in this space for
the algorithm it's OK So so
here's here's a funny thing if you take
today's C.P.U. and G.P.U. processors and
you extrapolate them according to the
trends that have sort of held in terms of
how cache size grows and how performance
that bandwidth change have changed over
time for the last forty years and
you extrapolate to twenty eighteen
taking into account you know transistor
shrink and all the other stuff.
These are the two points you get.
So we're basically building matrix
multiply machines this is what this says.
OK And in the meantime there's all this
other space out here what is that stuff or
any of those work building I don't know
are we even trying so sorry I heard a yes
so it's one of you building one
of these out in space yeah.
Right right so but
which machine are you building.
OK Bill that one.
The one labeled one.
OK So this is this is what
I call evolution right and
now that we have the map.
Now I can I can sort of come to you and
say you know
this is the map this is where you're going
is this really where we want to be in and
the answer is I don't know it
depends on the workload is my
workload more like a matrix
multiplier to fifty me who knows.
OK so there is one notional design
that's been put forward in people
who are building excess Cayle these X. a
flop machines there in India has proposed
they've written this paper the same paper
I referenced earlier sketches out their
notional design for a processor they call
echelon and we estimate based on this
paper that that that echelon
sits there in this design space.
So that's and I think that's encouraging
it says sort of leaping out of
this region where we've been stuck for
a very long time.
Is it really revolution again I don't know
it depends on the workload it depends on
a lot of things I can say about it is it's
better than it's notional projected C.P.U.
system by a lot on both problems but
interestingly enough it gives up on
performance for matrix multiply so
it so in order to move somewhere else in
this space there's going to be a trade
off we're going to give something up so
in this case you know should we make
Trix multiply so that this conversation is
a little bit more relevant in the H.B.C.
world where we're people everybody
complains about tuning for
matrix multiply and
the sort of makes it more precise so
the research question I think
is what about all this junk.
Let's let me let me make one quick
observation this is what I think is
the most interesting feature of this these
two particular maps it's what do they have
in common so the thing they have in
common is suppose I built this OK.
Notice where it sits on the X. axis.
All right I'm going to draw the same line
in the X. axis on the other side and
you'll see actually it cuts
through the region of interest.
For F F T's OK So without changing my X.
axis point if somehow I could
magically reconfigure the Y.
axis then I could actually do both of
these in sort of near optimal time and
given the constraints on power an area.
OK So but what it requires is some kind
of extreme power reconfigure ability so
we talk about speed stepping and tweaking
clock frequency is you know that change is
powered by a little bit right this is
saying you know can you change it by eight
X. So is this feasible and
again I don't know.
But I think this is another
interesting problem for for
the architects to think about so all of
these designs which ones are feasible and
could you build a system with extreme
power reconfigure ability and
there are people who
are thinking about this.
Right right right so under so so yes so
I think three D. stacking basically is
what enables getting into this range
the interesting thing here is can you
build a processor that you reconfigure
where you reconfigure the power and
the speed.
You know dynamically right I'm
running matrix multiply than I stop by
switch to an F.F.T.
I want to change the band where
I don't just want to have the maximum
I actually want to change it.
So take all the power that was for
processing you know shut it down bring
it to memory bandwidth I think people
are interested in this reconfigure ability
idea this says that you need to do it in
a big way like there's almost an order of
magnitude change here so
is that possible and I don't know.
OK.
So then let's see since we're basically
at the end to let me just end on one
picture which is what happens
when you take this and
you think about building
an entire supercomputer.
So again you have powered
area constraints.
And you're allocating these
resources in some way and.
You can build some systems and the the
thing is I'm going to take this back to
the algorithms people you know one kind
of interesting analysis Kent has done is
he said What is how do
algorithms power scale OK so
I have some algorithm and
let's say I give you a machine and I and
you tune these are each line
represents a different algorithm and
this is how fast it runs relative to
a baseline that uses some amount of power.
So every point on this curve
this is this green point for
example this is I I I build a fifty
megawatt machine I tune it for
the F.F.T. by running this model doing
the map thing finding the minimum and
then I build it how much faster does it go
compared to the twenty megawatt machine.
And this sort of separates B.S.
algorithms into three classes based
on how well they power scale in
this sort of idealized setting so
those of you are thinking about algorithms
what algorithm should I use well if power
is the constraint you want to use the one
that power scales the best maybe.
So maybe that will shift you in sort of
one direction versus the other so so for
my C.S.E. colleagues you know people talk
about well should we stop we should we do
less implicit stuff and go back to doing
X. plus of stuff so this says maybe
that you might get a big payoff there but
remember stuff G.'s analysis this flop
the communication tradeoff says there's
a limit you know based on balance so
anyway so there's some kind of story here
that I think is emerging and I think
sort of a it has directions for both
algorithms and architecture people OK So
with that I will stop and
I'll take questions if you have.
Thank you.
The word communication once
yeah yeah there are a bunch.
Of the simplest class of you've
heard of stencil computations that
one one good example but
there are there are many others.
For people interested in solver so I see.
There are some solvers people here you
know you pre-condition ers are one place
when we solve a big city a linear systems
where you can play this game you can have
a pre-condition or that's very parallel
doesn't do a lot of communication but
those tend to also be crappy or
in some ways you have a slower rate of
convergence so there's a tradeoff there.
Not quite so there are lots of
examples.
You know.
There is some so I guess depending on
which direction you're going in the you
know the there was a cliff in caches and
so basically there was kind of a there's
probably some working set phenomena Once
you have the workings that there's no
reason to have to to have more cash and
there's also kind of a concurrency.
Because cash and concurrency are trading
off that you might expect there's some
critical point where you don't
want to go too far beyond.
And say you know in in the power direct
you know there's something you need
you need to balance communication
with the rate of processing so
then again you'd expect essentially
there's some kind of balance point that
defines the cliff in the other direction
so in my plots that was the Y. axis.
Ya.
Know.
Yeah.
That's the other time and energy.
Yeah maybe other kind of
time.
Yeah I would just do auto tuning and
that will fix.
Your problems Well no I
don't think about that but
I mean we're academics so
I can talk about everything.
It's no time other other comments and
the reactions from the architecture people
you know like this is junk everybody
knows there's a time energy tradeoff
nothing new here.
What's an.
OK with something something you
know I say in public that's what
OK Well thank you very much.