But thanks for coming everyone I see a lot of the computer architecture students here and that's good so part of part of my goal here is both to get you interested in the sorts of things that we care about and also to reprimand you for doing a crappy job building machines so I'll try to do both but I'll start nights so the thing we're interested in so I so I For those who don't know me our lab is been working on making codes really really fast and of course people don't care about speed anymore it seems that actually what they care about is energy and power so just as an example my phone I unplugged my phone at ten A.M. this morning it's two P.M. now and the battery is down to about a third of its capacity so that really irritates me that's a big problem but it's not just on these devices of course the systems I care about are supercomputers and supercomputers are projected to sort of the next the system that we're supposed to build in two thousand and twenty if we build it out of the components we have today will cost somewhere on the order of four or five hundred million dollars a year just to turn on and so that when you haven't even done anything useful you haven't written any software you haven't hired any staff you've just installed the machine you probably haven't even cooled it at that cost so this is a huge huge problem power and energy and what we're wondering about is does power and energy have any implication for those of us who care about writing software and designing algorithms. And I don't know the answer and that's why I'm here and my you know my students and I are here asking for your help in your input and your feedback. So since it's it's mostly students in the audience. I want to start with the one thousand nine hundred five A.C.M. doctoral dissertation Award winner so this was Danny Hillis how many of you have read this thesis. You know those of you who are in computer architecture and you haven't read this thesis especially if you're interested in parallel computing you. Be ashamed of yourselves this is a fantastic Lisa's to read here is the reason why this is why you should read it it's the concluding chapter of this thesis which is new computer architectures and their relationship to physics or why computer science is no good OK So Danny Hillis was a computer architect he designed a connection with machine you know one of the early really major computer a parallel computer architectures and he was reprimanding the algorithms people he was saying you guys don't know what the heck you're doing you're designing algorithms with completely unrealistic cost of models and in particular he was reacting to a model cold P RAM How do you know what P. RAM is OK so he was saying you guys designing P RAM algorithms you're crazy and because at the end of the day you have to take this algorithm and run it on a physical machine and P. RAM ignores all cost in particular ignored communication costs but you have to pay the cost of communication so if you ignore it not only are you going to get an algorithm with crappy performance but you're also not helping the architects right so I as an architect or you as an architect I'm not an architect you guys you guys architects you want to help us you know build faster more efficient time in time and energy you build better software. But you can't do that if everything I run is not really designed for the machines that you can build. So this was a call to the algorithm Asus to really rethink the way they thought about algorithms. This is the most inspiring kind of I think you should definitely go read it and when you're sitting down to write your own pieces by comparison I think was extremely boring I wish I had thought to be as provocative so anyway someone so maybe some of you will take some inspiration from this OK so in my lab the three three people who are going to write the season finished soon are counterpart gene so what I'm going to do in this talk is I'm going to tell you our collective something about our collective thinking about time and energy from the perspective. Projects that they've been working on and these are all well except for Aparna these are all relatively early stage things and so there's an opportunity for some feedback and interaction so I'll start with a partner she's sort of the most senior so she had a chance to work on a team led by George bureaus which won the two thousand and ten Gordon Bell prize so for those of you don't know what that is that's sort of the highest. Performance demonstration award in the area of high performance computing this was for something that simulated red blood or red blood cells and Aparna worked on multi-core scaling so there was an algorithm and it scaled to the entire system pretty well but single node performance was terrible she came in and she fixed that. And there was a lot of performance engineering involved. So then after this performance engineering exercise she sort of stepped back and she said whoa so so what did we learn from all this Aparna sitting in the back and she wave your hands or we can see you. So she stepped back and she said so so did did we actually learn anything and she started thinking very hard in an analytical way about the fundamentals of the algorithm how many operations as a do how much concurrency is there and how much communication does it do so she did an analysis of a particular algorithm called the fast multipole method this is an optimal so linear time approximation guaranteed approximation algorithm for N. body problems. And she analyzed it for a cartoon many core architecture so this cartoon architecture has you know some number of cores a large share of local memory so think for the architects think last of all cash and then some slow down three and she looked at the algorithm account at how many operations this is do how much parallelism is there and how much data transfers back and forth between say main memory and this local fast memory where you do on it and she wrote down some horribly complicated expression because I'm too stupid to understand long and complicated things I. Boiled it down to its essence So essentially what she showed is that execution time is basically this this little formula so I walk you through the formula tell you what it says I mean there's something very interesting and I think very deep. Factor is the speed of light so if you if you didn't have to do any communication whatsoever this is how fast the algorithm an algorithm would run and you can see the form of this formula is speed of light plus some factor so what is this other factor is this factor is the communication penalty. OK And in this case it turns out it's basically just the ratio of peak flops the processors peak flop rate that's floating point operations per second divided by the peak bandwidth of this communication channel. And you'll notice the size of this total memory Z. This is not here and that's because it's a low order term and so in simplifying I just sort of dropped it so that's a hint for those of your heart that says that for an F. and them increasing cache size doesn't actually help us right it's it's completely irrelevant. Given it's likely caches are large enough today that this doesn't help us so much OK. Penalty is very interesting so this is the balance ratio of the processor of the flop to byte ratio of the processor and so the architecture people I'm sure you're very familiar with this sort of concept of a balance processor. And the be the troubling thing about this ratio is that it doubles about every four years so this communication penalty is getting larger and larger and larger So even though the speed of light gets faster because the processors get faster the communication penalty gets larger with time OK so today it turns out it doesn't matter because you know when we did this Gordon Bell thing it scaled to the full system it was and you know it's very very fast so it seems fine but this one. We're going to be in trouble if this trend continues so one. Question for the architects is is there anything we can do about this trend to change it in some fundamental way OK And there are lots of ideas things like stack memory and so on but do they really do they really address the problem to really make it possible for this to scale linear but essentially stay constant over time OK so that's a problem processor balance ratio so on the algorithm side is there anything we can do and it turns out there is and it is kind of interesting what it is you can do to compensate for this growing balance ratio and that's to drop the accuracy so basically according to this expression the main knobs for compensating for growing imbalance over time this read factor is to drop bits as many bits as you can. So so so it's sort of time for us as algorithm assists to stop and say how many bits do I really need absolutely need is there anything I can do to get rid of it and in some cases the answer will be nothing and in other cases if this is for and ask physics simulation you'll probably be happy with one digit of accuracy you can basically drop all the bits. OK So so that's that's sort of the prognosis and so this is this is sort of the our first cartoon or our first hint that if we go back to algorithmic first principles we might be able to tie them directly to architectural parameters and then we have something interesting to talk about right those of us who do algorithms of those of us who architecture. OK So so that's that's sort of the goal let me leave you with that one other little Q. analysis that upon us done which is to ask how much time is spent doing flops versus communication and it turns out today all the time is spent doing flops and that's good that means that it will scale but sometime between two thousand and fifteen and two thousand and twenty will stop scaling basically will be limited by communication you know people have been studying the earth and them for a long time we are very surprised to hear just they just sort of assume that an F. and them would be scalable. For a long long time to come but the window is closing unless you guys do something about this. OK So that's the sound of any any questions at this point. Yeah. That's right. That's right. That's right that's right thank. You. Probably not you know it depends on and they had to cross over the imbalance means the crossover will grow over time but maybe it doesn't grow too fast if the data is really huge then yeah that kind of difference I don't expect you know and squared algorithm to beat an analog and one no matter what the communications. That's right that's right. That's right that's right that's exactly right so it and in fact that's apparently that's one of the fundamental principles of parallel computing is you try to come up with parallel algorithms that are work optimal so they don't do more asymptotically more work than the best sequential algorithm and that's exactly our starting point here so you can't beat when your time on this guy well I mean you can if you throw away data I guess. OK SO GOOD other other questions. Yeah. Right so so that I think that's. The Basically they would so so that analysis is characterized by the ratio the ratio of essentially time to two flops and time to communicate so probably as you drive to a lower a processor with lower. Lower performance you will that ratio will tend to get smaller so that communication penalty will tend to decrease and that means you'll tend to shift this lie this crossover point into the future. That's right that's actually a good thing if you need more than one processor so if you only need one processor that's a terrible thing because if you're only one processor you only care about how fast it is so so that's so so there are some negotiation there but yeah if we're thinking about designing a supercomputer or sort of low power low. It would still be nice if there were high performance I mean there will be sequential parts of code and so on and so forth. But balance is the key according to this analysis. It's OK Other So these are good. Until you can write down the communication costs it's impossible to say. Right. Yeah yeah and that would be great it would be great so I actually think that I thank you for reminding me there's a so there's actually something very important in this previous slide which is. That this accuracy parameter enters super linearly right so even as a small reduction you know if there's a linear increase in imbalance a relatively small. Decrease in accuracy you can compensate for that because of this three have now three has a small but young in a case of machine learning maybe you only like astrophysics Maybe you need one digit you know you can drop lots of bits and they're big wins so yes. OK. OK So this is all just by way of sort of background motivation but the title promises to say something about the relationship between time and energy so that's that's kind of the interesting new thing so let's see if we can say anything concrete about that OK so now I'm going to switch so Aparna is now about to graduate so we'll stop talking about her work she can do that on her own and we'll start talking about the next person here yeah various. OK so so so now I'm going to talk about what he's been thinking about which is analogous to the roofline model for performance analysis how many of you know what the roofline model is if you don't know basically walk you through OK Nobody nobody knows that's that's too bad all right so let's go back to the cartoon so the cartoon is I have a processor local memory large a small fast memory and large slow memory and lets suppose I can write down an algorithm and I can count how many local operations it does and how many words of communication it moves right so we call these flops on processor and mops and memory operations in between all right so what's the execution time more we need some costs so let's suppose we let's suppose we know the cost of doing a flop and the cost of doing a lot. Then we can calculate sort of the total flop time and the total MOPP time and if everything is happening concurrently we basically will pay the largest of these two right so if we ideally schedule and overlap everything then basically the total time will just be the max of the component times OK So so far nothing controversial OK good so just do a little algebra so again like a part of analysis I'm going to pull out this speed of light factor and again you can see there is a performance penalty factor. And this factor is very interesting there are two ratios and so the first ratio is the blue part Q Over W. That's one over what we'll call intensity so this is the flop to bite the inherent flop by ratio of the algorithm OK this is a property of the algorithm alone the read factor is a property of the machine and it's again it's that balance ratio that we saw in upon us analysis OK so it's the ratio of these ratios that tells us something about the magnitude of this penalty and if it's greater than one then we'll be limited by communication and otherwise we'll be dominated by flops and that's that's what we want OK so this be this is the balance Rachel All right so this is time this is the story for TIME So what's different about energy. With the biggest difference in energy other other than the cost of energy energy here I really mean jewels jewels per operation or jewels per byte what's the big difference. OK the sum not the max All right I can't overlap energy I have to pay for all of it it seems like a small difference but it has big consequences OK So here I've just put down the some ivory factored in the same way it has the same functional form it's just some instead of Max OK so now I'm going to plot these and just. Going to do something a little bit funny I'm going to actually plot the inverse so I'm going to plot performance so one over time and one over energy normalized by the speed of light and sort of the maximum energy efficiency and when I do that I'm going to get two curves so the top line is the performance line that corresponds to time and it looks like the roof line so this is sometimes some people call this the roofline model this is Williams Patterson the water Waterman. And the blue line is that analogous line for energy and of course it's smooth because we're summing it so taking that extra sort of sharp inflection versus the smooth thing. And to get some actual curves. We went and we read a paper from the people at a video and we sort of pulled out their flop to bite ratio in time versus the flop to bite ratio in energy and we use those to plot the curves and this is these are the curves that you get so the balance with respect to time three point six flops provide balance with respect to energy fourteen point four flops per byte so about a factor of four bigger. OK So flying What does this mean does this say anything interesting so so that algorithms people in the audience you write down your algorithm you analyze that that puts you at some point on the X. axis and if you want to know how fast you could run if you tuned everything and you wrote the optimal code you draw a straight line sort of to the top right for fixing the algorithm. So so what you would put you so some sixteen would put you and you know somewhere on the red line and somewhere on the blue line it's the same X. value OK so so so what does this tell us. So this says something this has a bunch of really interesting things in particular this region in the middle the region between B T and B. what the heck is this OK this is a really this is kind of a funny zone so this Y. axis is logged too and you see where the half is so half means you're within half of the best possible right so either I met half the best performance or I met half the best energy efficiency so I think of this as the crossover point where on this side. Communication dominates time and energy and on this side flops dominate time and energy and you sort of want to be to the right if at all possible OK this zone in the middle it's funny it says that if I'm somewhere in here I could be running as fast as possible I'll be at the top of the red line was one but I'll still be under the fifty percent energy efficiency mark right so this says that there are there are two notions of. Compute and memory boundedness I can be in here I can be compute bound in time but memory bound in energy and this says that this is where I should expect to see some funny behavior that I'll see programs that are very very fast but they use a lot of energy maybe more than we would like. OK Because essentially because of communication or so this is kind of a funny funny little zone so it says that there is a if there's a gap and ve is bigger than B T then we might be compute bound of time but memory bound in energy. This is no so it's abstracting the algorithm only by its intensity of the X. axis and for a particular architecture we draw these two lines OK So so this is this clear so far. OK So all right so one thing is this funny gap so the other thing is. The fact that B. is larger than B. T. says it's harder to optimize for energy than it is for time so optimizing means I have a code that's maybe on this axis and then I want to get what I measure its actual performance I'll be here if I performance tonight I'll hopefully hit the curve. When I change the algorithm I'm going to try to increase the intensity without increasing the work. And in order to do that I can do that for time and get to a certain point where i'm time optimal but then I have to keep working in order to be energy optimal. OK so if it's true that on real systems be is greater than BT than this says that optimizing for energy is harder than optimizing for time. Not yet not yet well I mean we have we have guidelines for increasing intensity of basically it's reducing communication that's that's basically it and maybe reducing accuracy as we saw in apartments example then would be another way so yes. So far so far this doesn't tell us anything about what to do it just says that with the direction you need to go in. OK. Let's see so here's the other thing that it says that I think is very interesting I suppose you're able to hit one of these lines either this roofline or for energy I'll call this an arch line. If I have an algorithm I managed to get my algorithm on this side of the energy line. Then it means and I manage to hit the line curve then it's very likely that I'm also time efficient right because B. is greater than rights it is the corollary to what we just saw in the previous the previous bill. OK so. So this me so those are so how many of you have heard of this energy time energy optimizing strategy called Race to halt. OK So so what's race to hold what is I mean. OK so the theory behind race to a halt is you want to save energy just run as fast as possible shut everything off and then when everything is off you're not burning any energy so this says. This may not be enough. OK if the whole works then maybe it's because actually these are fluent All right so that's the sort of a corollary. Race to halt. Well because if. I could be in this this funny zone. Yes. OK. OK So so this is this is some Him So right now this analysis still very simple and still very abstract but already I think we're starting to see old There's the makings of some interesting parameterizations so far protects I think an interesting question is these beasts what the heck are these values and what are the trends in these values how do these values change over time so we saw how BT changes over time I said that. Doubles every four years what about B. Are these going to meet or are they going to cross over if the switch then I as an algorithm designer need to focus on optimizing time and I'll get energy for free but if they're like this and this gap grows then that means I need to stop optimizing for time and in fact optimize for energy OK so what we should do I don't know so and so architects What do what should I do. You know what the BT or. Any thought of any guest is. It's a democracy should we vote. You know OK. Yeah. Sort of was so we'll get to that So this is so right now we're just talking about a model and it's the cartoon model where we take the time model and just translate it to an energy model so I'm assuming that these constants actually are real and they can be accurate to accurately estimate it in some way. So yeah so with respect to this model we haven't done that yet. But I will show you some micro benchmarking data in a moment to sort of help you could help convince you that this is not that this is real OK. Let's see All right so the other thing I can do is I can ask about power so this is about time and energy power is energy divided by time so I can just divide the two curves basically and you can ask what those that give you that gives you another curve I'll call the power line because it looks like power lines. And you know again these balance points they'll be critical points somewhere in this curve that will say something about the behavior so in this case where B E is greater than BT It says that basically. Your communication bound you're also going to need a lot of energy and you're going to have to increase that power in order to go faster but then once you become compute bound in time then you'll start reducing this overall power consumption so if I want to reduce both power and energy then that I need as an algorithm person I need to move this way. OK. So that's that's just power well let's to for additional discussion for the moment. OK So all right so right now the algorithm is still abstract we want to say something maybe a little bit more here's just an example of something we've been thinking about doing an analysis for and I don't know where this will go yet. But one class of interesting computations are those that exhibit a work communication trade off OK so maybe I maybe I have a baseline algorithm called algorithm one and I can give you a new one and the trade off is that the new one has lower communication at a cost of higher flops OK So multiply flops by a little F. F. is greater than one and reduce Q. by and that's greater than one OK so if I have a work communication tradeoff then what happens so we can ask do I get a speed up. Or do I get an improvement in energy efficiency I'll call this a green up I think we made up this term I haven't seen anyone use it. OK And we can sort of do some analysis about this so we're not going to show you any details I'll say so for example suppose all you care about is getting a green up then what is the condition on F. and M. that guarantees you'll get a green up and if you just calculated a take you like three seconds worth of stock yourselves you get you know one plus something or another and this expression has some structure for example that says If I reduce communication completely so am goes to infinity I can do more flops but I can't do too many more flops the number of flops I can do will be bounded by one play. You know be over or whatever my baseline I was OK so and probably if you just stare that person can be like yeah that makes sense all right. Yeah there are many examples where this is possible so we may have found them algorithm for example that that a part has been working on there are ways to change the structure. To reduce communication increase computation stencils you can do this. Yeah you know when discrete steps but yet there are many classes of algorithms where this is possible. OK. OK So so for example it says if I'm let's say my baseline algorithm is right at the inflection point of the roofline and if I plug into the formula it says I can't do more than about one plus this ratio or five times as many flops OK So this is something as well you know that I mean this is just the analysis I don't know where we're going with this I guess we should test it probably but you can imagine playing these games there's more here but I'm going to I'm going to skip this part you can sort of say when do I get a speed up when do I get a green up when do I get both When do I get neither. You can ask and sort of find zones so these points these are not measured data points are just plugging in different values of F. and M. you can derive bounds on all these things so on OK So anyway so so that's the work communications tradeoff story that's one direction that we're going to you know we're going to do some algorithmic analysis in. And the simple those simple cartoon Lines tells us something interesting will happen OK So unfortunately I swept under the rug a huge huge factor on real systems and probably the architects of and thinking this along the way he's going to say something about static static power or idle kind of idle energy so this is the energy you burn in just by beat the fact of being on. During the time the. The computation executes even if you or even if there was nothing happening right there is still power being fed into the system and they're still energy being burned OK So this is baseline sort of static energy or static power so we could throw that into the model too so one simple way is just to basically say there is some static power whatever it is and for the time that I'm running I will pay an additional energy cost so you can plug that in the kind of work through it again. That's. OK so what G.'s been doing over the past few weeks is writing some micro benchmarks to sort of test all this stuff I just got a notification of my battery is about to run now after four and a half hours or so so this is so this is a micro benchmark that basically does a bunch of flops and does a operations that's all it does so synthetic benchmark that allows us to artificially vary intensity. OK so let's see what happens and so with here are two platforms a G.P.U. platform and a C.P.U. platform so let's look at the G.P.U. first so the solid lines are our model curves including the static power source term and the DOTs are are basically the measurements. And in the in the eyeball norm. These are pretty good match. OK. Right and that's because. For time we know the time constants from the specs for energy we don't know the specs so we fitted them but we fitted them to this three term model with the idle energy so it was just some you know regression fit basically So that's why it looks like it hugs. But it gives it but at least it says the functional form is on. Its feet. To these points and you know you get the energy constant so then once you do the fit now you have the energy concepts and cells. Yeah it's so the right hand side is a different benchmark actually that we got from Iraq who's sitting in the fourth row and so there's a question about tuning and whether we can increase it may be that the vendor specs are too optimistic so we haven't looked at that in detail yet but. But at least it of if alos the trend OK but that's supposed to be you're not supposed to look at that. OK So so the so I'm showing the fits now you'll notice on the energy there are actually two vertical lines that are drawn one is the actual balance point so the actual fifty percent mark the other is the balance point if static power word zero. So this says that basically the thing that we've built should obey this trend that we saw before and the fact of static power actually pushes this balance point to the left. So this is why we think people have observed race to halt it's not because of race to halt is real I mean it's it is real because it is because this is where it is but. This is why the race to halt strategy works today so you know the other interesting question to kick back to the architects is if I drive static power to zero. What first how close am I going to get to do it doing that and if that is possible then that means this sort of analysis with respect to time energy tradeoffs may become more interesting at the algorithmic level if not if this is just a fact of life and it will always be like this then that's actually good news for me it means I don't have to do anything differently than I've been doing in the past and you just keep optimizing basically the same way I optimize. So from a research point of view that's less interesting so I'm kind of hoping that you guys will drive static power to zero but if you. Don't This is this is what will happen yeah. It's just doing multiplies and adds it's you know some fully unrolled loop you know just getting rid of all this and you know extra instruction overhead junk. I don't know if you want to add anything else to you about what and what you have to do you know it's pretty much an. Right right yeah yeah that's a so the I mean the marker this is this artificial micro benchmark it will be hard to see these kinds of facts. And that's I mean the model is assuming some kind of perfect to me. So you know with a micro benchmark that's easy to do with something real will be harder so yeah so it's possible some of this will break down this is one of the limitations of the analysis of far but it we're just sketching an idea it's like OK let's write down some models let's put them let's kind of play with this and see where it goes so what the implication is for future future design. OK Other other comments before I move on. Right. Let's say OK so this I already said Now then there's a similar experiment on a C.P.U. The fit is not as good so there may be some more work to do there still. But but the trend is in fact reversed in an even stronger way so the time balance point is it's for there to the right to bend the energy balance point. So that's what it is yeah. That's right. That's right that's right. Yeah you know I yeah I mean the yes so I basically I think you said the right thing. On this right hand side platform are much before your and they do a lot more stuff it's certainly not stuff that's exercised by the micro benchmark but you have to pay those overheads anyway so you know that causes the ship. At this level we can't map to finer grained features I don't think there's enough stuff in the benchmark to do that but that is one of the things that G.'s been talking about doing so you know separately accounting for the energy to cash it at various levels you know doing all this other kind of stuff so one could certainly imagine doing that. OK. So I don't know architects look bored maybe they're like This is hopeless. Or if it's so the only thing I'll say about the power of this is just dividing the curves and dividing the energy and time curves. The fit is not as good on the G.P.U. side in particular there are some power throttling that's happening. So based on the fitted constants you know you could read you know you could burn two hundred eighty watts in the in the worst case but there are some some hardware throttling that forces that to not happen so there's some cap and that's that's the cause of this gap here so that's something not accounted for in the model and certainly power throttling will only become more and more important as we go forward so. OK So all right so so the main thing I want to leave you with with this what sort of first part and since we're running low on time I'll do the second part in basically five minutes is just sort of this idea of you know writing down to first principles model and really trying to map. Algorithm in architectural features together what I think we could all do together if this program kind of works is sort of the next part. And this is. Thesis so he's been thinking kind of more. In sort of a different way about a code designing algorithms and architectures and he has a model you could think of it as something kind of like what we wrote down in the previous thing but with tons of other stuff in it and in particular he wants to reason about power and Diario constraints and what implications that has for algorithms so let me just I'll just talk very briefly about the thing he's he's been doing and show you what I think is one very interesting picture. But just sort of by way of motivation so the first part was sort of bashing algorithms people so now it's time to bash the architects So this is so it back when I started this job or soon thereafter I went to a meeting and it was in some beautiful setting in the mountains and we were going on a hike and here I was with somebody from Nvidia and someone from Los Alamos the guy from Los Alamos was the chief architect of the Roadrunner system which in two thousand and eight the month before this meeting had just become the number one machine on the top five hundred list it was built out of cell processors OK So sort of the. G.P.U. colleagues basically are heterogeneous computing colleagues and we were hiking and we were going down we didn't have a map we reach some point and it was just kind of funny that there was a high stakes direction that's where the black diamonds were and there was this other sign pointing toward easier route and you could see them looking longingly off in the high stakes direction so this is in my view how you guys like to think of yourselves Here's what we actually did. We took the easier route and here's a guy a crazy crazy still looking off to the right. With his brain but clearly his body will be in the other direction so so anyway now so I said Of course. They say as well we were a bunch of we were with algorithms and software people so we had to take the easier route you know fair enough OK So but this is kind of that in some ways this is as an algorithms person this is how I view architecture sometimes and you guys do stuff and you throw it over the fence and of course what I tried to say in the beginning is maybe there's not enough communication. Interpersonal what we really need at this point is some sort of map and maybe a country that would tell us how to get to the bottom so what canst been doing is building this map and I'm going to show you the map and it's been or at least the first one and it's pretty interesting So here's the notional design problem he's been thinking about he's like OK we're going to build we'll start with one processor and eventually we'll think about an entire supercomputer and we impose some constraints on the system that we're allowed to build one has power they can't use more than a certain amount of power and the other is that every processor that's part of the system only has a certain number of transistors on it so this is a resource allocation problem or a constrained optimization problem. What I can do with power is maybe I can give you more bandwidth or maybe I can give you higher core clock frequencies and what I can do with transistors is I can give you more cash and I can give you or I can give you more cores and these are fixed constraints I have to allocate them in some way between one or the other and notional X. is this means there is a space of all possible machines OK and some will have many cores that are slow and some will have a few very very fast cores and lots of cash with the lower right corner. And given an algorithm what I can imagine doing is running this thing on all of these machines and finding the best one maybe the best one is inside the bull's eye OK And these are think of these contours as sort of contours of I so performance All right so so imagine that we take the models and then we saw in the previous thing that we take the algorithm analysis that our partner did that sort of sort of I lead off with and then we enrich them also to. To reason in some way about area and we put this all together so we have some model of time and we have models of power in the area and we solve this constrained optimization problem. OK so you imagine formalizing this the research is all about the models how to what do the models look like what terms they have to account for. And here are here are two of the maps that he's built so far so we're just starting this OK so one map is for the algorithm which is matrix multiply and the other map is for a three D. F.F.T. and the main difference algorithmically Is it a matrix multiply is very compute intensive and does relatively little communication in the three F.F.T. is just the opposite. OK And you know it first there sort of and so the optimal point is the DOT that's labeled one in the two plots that's the fastest machine in this space for the algorithm it's OK So so here's here's a funny thing if you take today's C.P.U. and G.P.U. processors and you extrapolate them according to the trends that have sort of held in terms of how cache size grows and how performance that bandwidth change have changed over time for the last forty years and you extrapolate to twenty eighteen taking into account you know transistor shrink and all the other stuff. These are the two points you get. So we're basically building matrix multiply machines this is what this says. OK And in the meantime there's all this other space out here what is that stuff or any of those work building I don't know are we even trying so sorry I heard a yes so it's one of you building one of these out in space yeah. Right right so but which machine are you building. OK Bill that one. The one labeled one. OK So this is this is what I call evolution right and now that we have the map. Now I can I can sort of come to you and say you know this is the map this is where you're going is this really where we want to be in and the answer is I don't know it depends on the workload is my workload more like a matrix multiplier to fifty me who knows. OK so there is one notional design that's been put forward in people who are building excess Cayle these X. a flop machines there in India has proposed they've written this paper the same paper I referenced earlier sketches out their notional design for a processor they call echelon and we estimate based on this paper that that that echelon sits there in this design space. So that's and I think that's encouraging it says sort of leaping out of this region where we've been stuck for a very long time. Is it really revolution again I don't know it depends on the workload it depends on a lot of things I can say about it is it's better than it's notional projected C.P.U. system by a lot on both problems but interestingly enough it gives up on performance for matrix multiply so it so in order to move somewhere else in this space there's going to be a trade off we're going to give something up so in this case you know should we make Trix multiply so that this conversation is a little bit more relevant in the H.B.C. world where we're people everybody complains about tuning for matrix multiply and the sort of makes it more precise so the research question I think is what about all this junk. Let's let me let me make one quick observation this is what I think is the most interesting feature of this these two particular maps it's what do they have in common so the thing they have in common is suppose I built this OK. Notice where it sits on the X. axis. All right I'm going to draw the same line in the X. axis on the other side and you'll see actually it cuts through the region of interest. For F F T's OK So without changing my X. axis point if somehow I could magically reconfigure the Y. axis then I could actually do both of these in sort of near optimal time and given the constraints on power an area. OK So but what it requires is some kind of extreme power reconfigure ability so we talk about speed stepping and tweaking clock frequency is you know that change is powered by a little bit right this is saying you know can you change it by eight X. So is this feasible and again I don't know. But I think this is another interesting problem for for the architects to think about so all of these designs which ones are feasible and could you build a system with extreme power reconfigure ability and there are people who are thinking about this. Right right right so under so so yes so I think three D. stacking basically is what enables getting into this range the interesting thing here is can you build a processor that you reconfigure where you reconfigure the power and the speed. You know dynamically right I'm running matrix multiply than I stop by switch to an F.F.T. I want to change the band where I don't just want to have the maximum I actually want to change it. So take all the power that was for processing you know shut it down bring it to memory bandwidth I think people are interested in this reconfigure ability idea this says that you need to do it in a big way like there's almost an order of magnitude change here so is that possible and I don't know. OK. So then let's see since we're basically at the end to let me just end on one picture which is what happens when you take this and you think about building an entire supercomputer. So again you have powered area constraints. And you're allocating these resources in some way and. You can build some systems and the the thing is I'm going to take this back to the algorithms people you know one kind of interesting analysis Kent has done is he said What is how do algorithms power scale OK so I have some algorithm and let's say I give you a machine and I and you tune these are each line represents a different algorithm and this is how fast it runs relative to a baseline that uses some amount of power. So every point on this curve this is this green point for example this is I I I build a fifty megawatt machine I tune it for the F.F.T. by running this model doing the map thing finding the minimum and then I build it how much faster does it go compared to the twenty megawatt machine. And this sort of separates B.S. algorithms into three classes based on how well they power scale in this sort of idealized setting so those of you are thinking about algorithms what algorithm should I use well if power is the constraint you want to use the one that power scales the best maybe. So maybe that will shift you in sort of one direction versus the other so so for my C.S.E. colleagues you know people talk about well should we stop we should we do less implicit stuff and go back to doing X. plus of stuff so this says maybe that you might get a big payoff there but remember stuff G.'s analysis this flop the communication tradeoff says there's a limit you know based on balance so anyway so there's some kind of story here that I think is emerging and I think sort of a it has directions for both algorithms and architecture people OK So with that I will stop and I'll take questions if you have. Thank you. The word communication once yeah yeah there are a bunch. Of the simplest class of you've heard of stencil computations that one one good example but there are there are many others. For people interested in solver so I see. There are some solvers people here you know you pre-condition ers are one place when we solve a big city a linear systems where you can play this game you can have a pre-condition or that's very parallel doesn't do a lot of communication but those tend to also be crappy or in some ways you have a slower rate of convergence so there's a tradeoff there. Not quite so there are lots of examples. You know. There is some so I guess depending on which direction you're going in the you know the there was a cliff in caches and so basically there was kind of a there's probably some working set phenomena Once you have the workings that there's no reason to have to to have more cash and there's also kind of a concurrency. Because cash and concurrency are trading off that you might expect there's some critical point where you don't want to go too far beyond. And say you know in in the power direct you know there's something you need you need to balance communication with the rate of processing so then again you'd expect essentially there's some kind of balance point that defines the cliff in the other direction so in my plots that was the Y. axis. Ya. Know. Yeah. That's the other time and energy. Yeah maybe other kind of time. Yeah I would just do auto tuning and that will fix. Your problems Well no I don't think about that but I mean we're academics so I can talk about everything. It's no time other other comments and the reactions from the architecture people you know like this is junk everybody knows there's a time energy tradeoff nothing new here. What's an. OK with something something you know I say in public that's what OK Well thank you very much.