There's a long list of the pieces of work for your research supercomputer which is our top distinguished people working group last year the recipient of last year's also is of George Michael or so. So it's really a pleasure and honor of your SO PLEASE JOIN ME MORE on. Thanks for so little. I mean audible. OK so I'm a doctoral candidate at the university different I work with person like me who typically goes by then in Sanjay. And my doctor they will be focused on a mighty system which is on automating typology or mapping for large panel machines before I jump into the into my dissertation. I'll give a brief motivation for why we work with this P.C. What are the kind of applications I work with. And then I'll try to read my specific wording in the area so my interests are in performance analysis and tuning off. And within performance analysis I specifically want on communication optimizations which tends to boil down to mapping off applications and load balancing publications nine million open not done being two of them. They're both written in time plus plus which is a C. plus plus be used by liberal I mean model Illinois by my advisor and the constant goal we are trying to achieve is better performance to help scientists get their simulations done as fast as we can. So just a few examples of some applications which using. Large I look Asians on simple computers wife is the. Research and forecasting model. It's a collaboration between and God and a few other places. And in this figure you're seeing forecasts for March twenty third of the precipitation in contender us and then this one shows the pressure of values all over the US. And I just described the applications first and then try to motivate I'm talking about numpties another application which does modular dynamics here we are seeing one of the largest simulations ever done. It's. Binding off I was on to the channel. It's a two point seven million item simulation and this is a collaboration between my advisor who heads Apollo program lab and Prosser plus children at the total tickling competition by a physics group. At Illinois. Another application is Flash which is used for a cosmological simulations. And this book. This one of the game one of the first fully three types open of explosion and this is a collaboration between are gone. Us to do Chicago on a few other places. So by giving examples of these applications. What I'm trying to say is if you try to run these simulations on a single machine. It would take years maybe more than the lifetime of a human and hence these simulations which of these typically take a very long time you're trying to use simple computing power to get these simulations done in time ranges which are meaningful for example for weather forecasting you want to do it within twenty four hours. So you can forecast next days weather and hence there's a lot of work going on in the P.C. area this is a graph from the top five hundred list which is published every six months and it lists the fastest five hundred supercomputers in the world. The model line is the range for the five hundred or the middle one of the ten for the top of the computer and this is for the computing power available. Some of the five hundred commuters we have. So we're seeing that we have crossed the protest kill barrier last year and we're moving towards X. our scale and so there's lots of interesting were going on. Lots of challenges we face as the scale too large machines. Looking at the size of these machines. We have five Sobel computers which have more than one hundred thousand processors and about thirty two processors in total in this range. And the work which I'm going to present today is more relevant for machines of this very large scale. Although it can be. It can be made to work for smaller machines. It is still to be seen if it has similar performance improvements on smaller machines as I would present today for these large support computers. So this was just a motivation for each P.C. I will try. I will now get into the motivation and reduction of the work I have been doing after that I move on to discussing convention on supercomputers anthropology of a mapping. And then I will discuss the automatic mapping framework which we are trying to develop which elevates the application developer from bring them up in itself. Everything is done in the runtime automatically. So the application developer can get on that legacy and users can get improvements from this mapping for free. So open item is application highly paralyzer it runs on large number of course this graph is from back in two thousand and six we were trying to run this on the Blue Gene L. machine at I.B.M. P.C. Watson. On the X. axis is the number of course we were running on on the Y. axis is the time period to ration our time for step and seconds and the simulation is off to do water molecules it's a very small system. It's not expected to scale very well since a small system but still what we're seeing here is. Two thousand processors. We're looking at in the graph and. The scaling from two thousand onward is not very good. And you would ideally you would expect it to go down as it is going from five hundred people to one thousand processors and we try to do some perform in performance analysis on the why this is happening. This is the view from the projections performance analysis tool which is a part of John plus plus plus plus has automatic instrumentation off but a lot to cations and you can visualize the lot this is you'll get to see various things happening with that application this is one particular visualization which is called the timeline view so on the X. axis is time on the Y. axis we have few of the processors this is was a thousand processor run and a few randomly chosen processors are shown here. Each of these different colors is one particular function being executed on a given processor. So what we see in this red box is that there's a lot of write which is corresponds to I don't I. And so the processor is doing nothing just waiting for something to happen. And for the nonce is sure that most of the time is is being spent on messages and on this point all messages are received and then all processes start executing again. Now what we noticed was that. And again you can see this is around nine seconds. What we noticed was that messages of the sizes which are being sent in this application should not be taking such a long time and we attributed this to network contention. And once we did typology mapping to avoid contention of the network. We were able to get much better performance and that really sure improved locks after we discuss what we did to solve the problem. So if you have one be mesh. So if you just took processes on one line. Typically most of what computers today use vom all routing for sending messages on the network what that means is a given message is broken down into smaller flutes you send a header packet out a header file it was decided that out of the message. And all of the remaining flips in the message just follow the header flow. Know what that is not based on that model you can message you can model the message latencies by this where the first term is the latency or the time taken for the header slit to these the destination. So you have a difference on the distance there at each new processor you go through each outer you pass through you have some lead in C. and the second term of the banquet where L. is a total size of the message and you divide that by the bandwidth available on the link so that gives us the message the total time for the message. Who is the destination if you assume of almost routing model. Now. Typically people have assumed that since the first turn which is the length of the header flick. Times the distance of the distance the message is startling is small. You can our neglect the first term and mostly the time for messages and is a second term. But this only works well if you have this kind of a communication back then. Where messages are not sharing network links so you just see that processors are communicating pairs and everything works fine. In such a scenario where you have different messages cheering. Links between them then you have you actually have contention for some of the middlings in specifically for this particular link here you have three messages trying to use that link and hence the bandwidth available for each link is reduced by one third. The bandwidth available to each message on this link is reduced to one third and hence every message is delayed. When in such a situation now when you have this situation you cannot use that creation to simply model message latencies. So and again this is the problem we want to avoid what we want to do is we want to place this. We want to place the process on this processor and the last one closely. Each other so that messages do not travel far in the network. That leads to less contention and it leads to better messaging time and to the overall better performance. And I describe this in terms of a simple one D. mesh some of the largest supercomputers today. Almost all computers have some interconnect apology. Some of them are three dimension measures so the cream machines sixty three four and five have a three D. tortoise if it's the whole machine if you get a smaller partitions typically it's really mash. I mean machines B B G L and B G P three D. do it. I and then you also have some of the biggest machines as Infineon cluster So for example the engine is an infant about machine the road in a machine which is a bit of the diagram there is again infinite and machine Los Alamos. I.B.M. also has its proprietary Federation interconnect which is again infinite and so and again in the future if you might have even more radical typology such as the new B G Q machine which will come up out of the blue what is machine. And what we want to do is to be able to exploit these typologies for better performance. So that's the topology of the machine. Most of the time applications also have a communication topology or a communication graph. For example the which was the first example a short had a symbolic two dimensional communication back and so it's a stencil a comfort dition every process it is talking to for every process talks to four neighbors one in each doing these Direction NAMBY again has a specific communication paradigm and I won't go into the details. Flash which is a unstructured competition will have a regular pattern and so. So again each application has a certain communication graph which we can exploit and we can Madison but it was a sort of ology. So that brings us to the. Mapping problem to a wide contention what we're trying to do is. Given the interconnect apology of the super computer and the communication typology of the application. We want to map the tasks of the federal entities you know what application to the physical processors. To optimize communications. Our first order of interest is to balance competition a lot because if you don't have a load balancer you want to get good running times. You do not want a certain processor to be spent. You don't want to start in processor to spend too much time in doing a work while others are waiting for the first orders load balance and then the second order of things is to minimize communication traffic on the network by locating communicating objects on nearby processors and we hope that by minimizing contention. We can get better performance. Before I go into the. Actual techniques. Just some related work and how my work is different from on previous work so in the eighty's there was a lot of work in a mapping object graphs on to processor glass. And most of these were thirty years studies the processors were the machines were small. So for example one twentieth processors and so on. So the number of hops or links each message was travelling was not a lot these techniques were slow and offline because they did not get about time complexity it was like one point eight processors even if the. Algorithm was order in Cuba. It was not it was not taking a long time. In the ninety's. What really got through involve the whole routing were introduced which brought down the messaging time because the initial dependence on hops was reduced. And people thought that mapping is no longer important with the upsurge of machines like blue B.G.L. and B.G.P.. I.B.M. started saying that the policy is becoming important an obligation developer should be attention to it and so some recent work has been done in. By people at I.B.M. and by independent application developers. What I am trying to do here is the establishing that mapping is important we have done various convention benchmarks to prove this. And we are also trying to show that this is also important for machines like a bit of a much faster interconnect with a very high bandwidth. And then the what we try to prove that the work is important. By using the real scientific applications and running them in supercomputers using the algorithms we develop to see if there are actual performance improvements. Finally we are trying to develop scalable and fast runtime solutions because machines are becoming larger and larger We want to be. We want to do things that runtime as a simulation is proceeding so we don't want to take a lot of time doing the load balancing. And we're trying to make everything application independent so we are developing an automated automated mapping framework which can take a communication graph for application build the mapping and that occasion can just use the mapping solutions to get better performance. And just a small comment about the scope of this work. We are currently focused on three dimension to rise machines we might also look at Infineon networks and other networks but we haven't done that so far but still these machines form a significant percentage of the top five hundred list in terms of the compute power they give us maybe not in terms of the number of process a number of entries in that list. And then there are various applications in parallel computing we can broadly divide them into competition bound and communication bound communication bound applications are the one which which will be most influenced by these techniques. Even within communication bound applications you can divide them into two categories the latency tolerant applications. These are the ones where if a processor is reading for a message. It can do some other work NAMBY is an example of a latency tolerant up to. It does virtualization So if a given object on a processor is doing some is waiting for a message. Another object in the same process that continues to do some work again those applications might not get that much improvement. They don't see sensitive obligations which actually read for messages before they can proceed further are the ones which stand to benefit the most from apology of a not being. So any questions so far. OK so I'll move on to the first part of my dog which is on contention and topology of the mapping. So we are trying to prove that you're claiming that convention does affect misses latencies so we don't do benchmarks in simple and one of them does a new convention scenario and the other one creates contention on the network in the first one and this is a one day simplification of the three daughters on a machine which was one particular rank on the job partition which is allocated and that processor or the monster processor sends messages to all of the processors in the job partition. But seriously wanted to die. So it's just sends a message to this process or done after that to this process that and so on. So you do not create any convention in the network and you record the message latencies thought these message sends. What we expect is that since there is no convention on the network. There should not be significant dependence on hops or the number of links messages travel. And and let's see the results and I can talk more about this. So this is how that is looked like for on the X. axis is the message size. So we do this sort of different message sizes and in the fs of the time it takes for a ping pong between these chosen process monster processor and other processors. There are multiple circles here for societies which is the time it takes to send messages to different. Such as on the network. Now you see a spread here. Or for the smaller messages and it significantly reduces to just one circle. This is happening because we have a very small outing ignition. And the red line shows the difference between the MIN and MAX year. So the difference is about thirty percent. On the lower side and goes on to around five percent there. So the difference on the smaller messages coming from the first or so there is a difference on hops for small messages because the first time is not negligible. When you go to very long messages the first time is negligible and hence you just get depends on the size of the message. Now compared to this when we go to and again this is another plug on the Cream Machine which has a much higher bandwidth we see similar results for small messages there is a small dependence on the number of links rather but for larger messages there is no dependence. So these look similar for the two machines. When we do a convention benchmark in this benchmark what we do is. Processes our peers so that in the first case every processor talks to another processor which is one hop away. Then the next morning every processor talks to a processor of just two hops away. And so on. We do this from one hopes to end hops the maximum or feelings or the diameter of the network. What you can see is for three hots. A given link is being shown for the example this link is being shared by this message this message and this measures. So we expect that the bandwidth is being shared and hence my solution sees what increase. This is how the plot looks like. Again on the Y. axis message sizes. Now what you see is for smaller size messages there's no dependence on the number of hops the message is travelling. From when we go to large messages. So all of the red line is the one hockey is of the baseline where there is no contention. Everyone is talking to someone one hallway. And as you increase the number of links each message drivers you see that the time increases. Significantly. This is on the right axis it's a lot still so the difference between. The bottom most line. And the top most or one hundred eight. Obs is actually sixteen times. So what this is saying is for small messages there is no contention because the number of packages are small for a lot of messages when you start sending messages for that are we on the network as you increase the number of Hobbs' you see a significant dependence on the mail from the distance and on contention. And this is the problem we want to avoid We do not want messages to be sent far away on the network in a real battle application. These are the results on the three machine at Pittsburgh. It's a much faster machine machine in terms of the interconnect the link bandwidth on Blue Gene L. machine is around one seventy five M.B. per second on this one it's around three point a deeper second. So there's a smaller dependence but it's still about two times between the one and the seven hundred case. So we expect that we can that the techniques we develop will still be useful on machines like decree. OK So typically machines tend to have much better effect of bang with compared to the peak advertised and with and for clear machines it's much less this would be around two G.B. per second. Compared to the point it was there were days and still be within ten percent of the peak advertised bandwidth. So on create machines you get much actual bandwidth and they're advertising much about it. So now I'll move on. Do. A case study for open item and how we did typology of that mapping there and the performance improvements the opt in. So we already saw this plot. We have a king there in the graph. We're trying to see why the performance problems are there. We found out that convention on the network was a problem. We also knew this because this was developed in-house and we knew that this is a very communication intensive application it does a lot of fifty's transposes which leads to all to our communication. Many to Many communication. So just a small introduction an open item. It's an issue. M.V. code. Communication in this application is started does not change with time. If you map objects and certain processors you will get the same communication pattern as iterations evolve. It is a regular in the sense you can make nice graphs in two D. or three D. for the communication patterns. The only challenge here is that we have multiple groups of objects. Because this is written in jumpers plus and you can have multiple areas of objects which are a map of the machine and. Unlike M.P.I. where you just have one process or physical process or so there are multiple groups of objects they have conflicting communication patterns and that's the challenge. So the way this is application is paralyzed and I won't go into the details but these three are the more important it is in terms of the communication requirements. Space and real space communicate with each other in there in plain vice. So this particular rule of space communicates with this particular ROVs real space. Space and communicate stay device on thing and the other way around this is state wise and this is plain rice. So this particular column of G. space communicates with the first plane here of peer calibrator and so on. So you see there is a conflict. Pardon. In the sense we want to map this this group and this group in one dimension and this group and this group in the other dimension and we want to co-locate all of these together so that the communication is minimized this work is joint work with Dr Glenn Martin at I.B.M. and Dr Marc Tuckerman at N.Y.U. and Eric Von from my lab. So what we did was we did our to volunteer mapping of Charice which are the objects in jump last Plus that is the G. space objects we first mad that on to the three to us partition. Once we do the mapping of that we try to map communicating. States of real space close to the communicating space of just bits of just because Mab year we tried to deal space here. But what that does is it to start least keep pace and we'll see objects in the same area or same region of the part toward us partition. And depending on how the space is not for example this green plane. We try to map this plane close to the objects. And these are the three important communication patterns are several other communication patterns and other objects in open eyed demos are also Mab according to those using this mapping we were able to bring down the overall time pushed for the supplications significantly. So this is the only one which took around eight point five seconds. The new one need takes five point two seconds and you see that the time spent in this phase is reduced significantly. We were also able to do is the time for this to this one and this might just be because of better load balance in this case. It's also important to remember that this particular mapping the default one was not an intelligent it was a very good load balanced nothing. It was just not communication of it or it was not topology of it so it was not trying to map objects. Physically on physically close. Just doing typology mapping brings us to this better performance of a lot of the time Bush term. Yes yes. I mean this is I've just got a graph at the time when we are certain boundaries. So this might still be from the last phase in the last part of the times to appear so I don't have a good answer for that but. There might be several possible reasons for that. It might be but it might be because US since the mapping has been done differently. There might be other things which can lead to this. But overall we are able to deduce the phases which used to take the most time and hence. We get much better improvement on this. So you are asking what is the optimal time that location can do. I don't know if I have a good estimate on that. Yeah I mean. The best could be if we could sink all of this here. Yes although you have you still have all the advice. But it might be that because it looks like that this block here is a reduction it even is reducing the process as you go so. And you are still the black here to spread all over. So if you could get further we might get much better improvement. Yeah yeah I mean the best possible would be if you could get out of this white space the right space here and then here and. So what this does is it saves. So what simulation which would have been which would have taken six months now only takes three months we are there and they use it to have the time on eight thousand processors. So it seems over allocation time when the supercomputer it also gets us faster sciences olds. And so it's better both ways. This is how the scaling performance looks for this up. Cation that was on eight thousand processors. Now the red line is the new plot with the topology of a mapping. Since this is time for iteration load is better. So we're able to get better timings. We got rid of this artefact here and the time is much better. And as we scale up we're going to be doing much better scaling this is for a smart system of credit to water molecules we can see what it looks like for. A larger system which is to have this is water molecules. Again this is in the same Blue Gene L. machines. And we still get not as good performance but we still get good performance around here would actually do it again doing half of the time we were doing taking for the before and mapping. And both of these systems are benchmarks used by C.S. computer scientists to benchmark this application. We actually also used a real. System which was being used by use by scientists and we saw that also gets similar performance improvements and we still see good improvements between the Green Line and the purple line there. We also want to make sure that this works well with other machines. So these are results on the could actually three machine. Now created C has is a is a is a three D. Taurus. But the sorry the job scheduler there is not topology of air. So if you decrease for a job you might get nodes all over the network. So these runs are done through a system has a ration where we can are located. Three D. contiguous partitions and then we do our runs to see if we get better improvements. And if you can see we this is the default line and since we're getting similar improvements. Just like. It was important to remember that Korea has twenty one times more bandwidth than B.G.L. at least the peak advertised values and still communication is a problem we are still able to get good improvements and. This is for the other system and we see good improvements here too. So. OK so this was a case study for our open item. And we also did work on Nnamdi and since I have limited time I would not be going into that but once we did mapping for these two applications we realized that application. We were looking here is application developers really that we should not have to do this in the application the runtime should be able to do the mapping automatically so that the application can get this for free. So the major part of my thesis is on automatic mapping framework. We are developing for both Charmin M.B.A. applications and in future for other bottle programming models. What we want to do is opt in the communication graph do some fact in matching to find out if the communication graph is regular or irregular depending on the communication graph. We choose the mapping algorithms we develop. So I have not get into the if they were glass in the start but are you talking about how we are in the process or topology graph and the communication and I'll give some examples of how we do mapping for certain applications. So that doing the two inputs we need for the framework is the process of the policy graph and the communication graph for the application then we do part in matching to our in defier that are two D. three D. or forty near neighbor communication patterns. If it does that which is very different where the regular patterns we use started here are sticks and if it's a regular Titan then we use a different set of more general heuristics. The first thing is to open the processor topology graph the application or the runtime would need information such as dimensions of the partition. And the mapping of the two physical coordinates so which M.P.A. process is on which physical processor and how should I move it around. And most of us are so we have developed a super manager which is a uniform A.P.I. available on I.B.M. in Korea machines which we make available on IB even clean machines on B D L N B G P V provide a wrapper for system cause which are already available. Again advertises that you do not need to do to apology that mapping they have a fast and. So they're not a system calls to obtain. The policy information when you're running a job so we work with stuff it or another B.S.C. built in this information ourselves and we have to opt in that information but this interface makes things independent of the machine. Whether you Don and I B M B G machine or a clicks the machine. As long as you're running on a thirty dollars machine you can get this information about what are the demolitions of a job partition. And then do the mapping of your application. Or yes you know. So most M.P. applications do not use it for example the only player. So I guess what you are saying is if you use the M.P.A. card options does it have the underlying topology information and does it do a good mapping. So the only employee implementation I know is the I.B.M. Blue Gene P. one which actually tries to do something when you try to map from when you use when you specify your topology for that application but like on our Infineon machine. These do not translate to a good mapping underneath and they have no information whatsoever about the underlying machine topology. And that's one thing we're trying to do. We're trying to push these things into M.P.I. you saw documented of sort of build up who leads some N.P.R.. And we're trying to see if we can make the M.P.A. card implementation typology of air on other and very implementations as well. The next thing we need is the object communication graph. The readers to if you are a developer for the graph you could opt out if you're the application developer for the application. You could out obtain it manually. You know that location the communication patterns. But we want to do is do it automatically. For N.P.R. Blick Asians We use profiling. So we use I.B.M.'s it's P.C. toolkit to get the communication matrix of the application which a process talks to which other M.B.A. process and how many by said communicated. When we were with done justice applications the time has a instrumentation framework which can give you all this information at runtime. Once we have the graph. We want to visualize the graph and do pattern matching. So for example looking at the research and forecasting model. This is the one of the application on three two processors of B.G.P.. And this is how the communication matrix looks like you have zero to thirty one M.B. A process easier and for each process you have which other M.P.'s answer talks to. This is a very regular application so the number of bytes communicated is the same. So the color of the squares are similar. This is supposed to be so that if you have a different if you have different amount of bytes you would see different things. Now given this information. If you're trying to do automatic mapping you want to know of this is a if this is a regular pattern. So we use pattern matching to find out of this is a regular pattern. It turns out that in this case it's a simple two D. communication. So this is how the communication looks like it's eight by third communication graph where each and PE process communicates with four neighbors in each direction. For some applications we do have wrap around for some we don't. So ones. We have the communication pattern we want to do a mapping and provide this mapping solution to the application so that when it does a new one next time it can use the mapping file passed on the bobbing file to the job to do that and change the ordering of the mapping of ranks on the physical processors. So I'll be discussing how I would be discussing some algorithms we developed for structured ought to communication patterns. So we have developed a suite of humoristic techniques to do this mapping and just discussing two of them but I'll show results from several. One possible techniques so this is the left is the object graph on the right is the processor graph you're trying to map. This object graph on to this processor graph. Again the communication by did as you mean here is structured two dimensional stencil like competition so an easy mapping would be you take the maximum region which you can overlap between these two graphs. You map those regions of the object out of the processor graph and then you're left with this regions which are unmapped and then you do recursive cause to the same thing again. So now you try to map the smaller region on the diet. Another thing would be you could rotate the graph initially so that the definitions kind of much but they could be different and you would have to work with whatever you have another possibility to stick is where you start from one corner of the object graph and try to place that corner under the processor graph. So you map this one here. Then you might have the next two which are near neighbors of it here in the map these here and so on. Although when you start reading the end you'll have some mess here. Which you need to handle. The technique which has been used in and this was not developed by us. It has been used in V.L.S.I. and this is the people which mentioned that. So you take this object graph and you want to model. Under this process that graph. Again this is already a line in the longer dimensions are matching. What we do if we map one drawer at a time so you mad. The first row of the object graph like this the new map the second row third row and so on. So this is how the mapping of the object graph under the processor is rough. Looks like and I'll show how this this mapping looks like for readers who just picks. Will remember that there are some connections which are not showing here. This actually looks like this doctoral communication pattern. So if I draw all the wood tickle connections on that graph. It looks like this. So what does happening is some of these connections are being stretched here. So this particular slide is so is showing the mapping solutions for a given object graph and a given processor graph using various different shooter sticks. The framework at runtime to does the hugest tick which is best and I next described how do we select the history which is best. We use the hot bytes metric to read mapping algorithms have braces the weighted sum of the message sizes and the way it is the number of hops or links it misses travels in the new mapping. So effectively for each message we multiply the distance it travels by the number of bytes or the size of the message and we sum this for all messages then messages travel further on the network. This turn will be big and you'll good a bigger value for the hub. Which means there is more contention. Because if every Mrs Starling for that means you're creating more contention. If you can keep this well you know you will get a smaller boats value. And this will indicate there's less and less convention in the network. So we use the we use a heuristic which has the lowest ha bytes which indicates that we'd be creating minimum convention the network. Peevishly and. Metric was used and it is still used in V.L.S.I. sort of design which is the maximum dilution so you try to find out the dilution the maximum dilution for any given edge. In the graph and we think that the first one is a is a better metric for a battle computing. And again I cannot go into the details right now but we can discuss this later for the thick serious texts I mention these are how the whole box look so depending on those this is given for this particular case that one looks like the best one. It has the lowest hobbies and that is chosen at random automatically for the mapping. It turns out that for this particular case that one also gives us the smallest dilation and so it is good. Either way. Now once we have this. We want to testify that algorithms are actually doing good for real applications so I'll be chosing we're using one of the applications we discussed earlier which is what we have actually done this for three different applications milk is a large disk you see the application pop is the pollution program. And vote of our for you sort of weather modeling I'll be showing how we do how we did this for a what if this is dying to work with isn't I.B.M.. What we do is we take the application. We don't need on Blue Gene do use a P.C. the tools to dumb the communication patterns. Then we do pattern matching on based on that we did I am having offline for a given communication graph and for a given process or the policy or the number of processes we're going to run on and then we posit a new mapping Felician to the job scheduler on Blue Gene. For the subsequent run. So what we saw was who just export to duty. Not since we were trying to map to the communication grass to the three D. tortoise of Blue Gene Autry we need to somehow map from two D. to three D.. So. We do what we do is before the election get off on to the daughters there are two ways we could fall. Most of the times this one very tried to keep. Neighbors on the same side of the tourist worlds better than the other technique. And again. One time the best today has to get stores and and then falling for that one is done. So we did this for a off. This is how the look like for this application so on the X. axis is the number of course we run Vodafone on the Y. axis is the average number of hotspot a bite. Processor. So average hoped for by the similar to the hops. Metric is telling us how many links are hops each message is travelling. So you can see that for the default mapping or whatever default mapping is done by the Blue Gene jobs could do little on the Blue Gene P. machine. We get around every message actually dollars to links. We were able to reduce this to less than one point five. And as we go. Are we actually are getting much bigger improvements so. At two thousand processors every message was traveling three hundred it now. Travels close to one hop. So we have brought the communication down to nearly able in actual physical topology not just on the virtual processor not just on the virtual application apology. It remains to be seen if these improvements in hearts lead to a better performance for the application. So. For the communication timeout so that we divide the running time for the application into the come communication time and the competition time for these two cases the communication time bruised by two percent. This case the communication time improved by forty five percent. We have led to a significant reduction in the number of cards for this case we see a increase in the communication time. But we still see good improvements for both of these cases. So we see the improvement of seventeen percent in the overall application performance. On one thousand cores and on off eight percent on two thousand course we're certain to look at we get this increase in the communication time and not a decrease. And we still have to figure out how we're still going to performance improvement. It might possibly be because of little contention in the network and so. What is important to remember to remember is that this is a very complex. That each got a lot peckish and is fond of several combinations and improving performance is very complex. So for example in this particular case and one thousand course we were able to do the hops by sixty four percent the communication time to do the forty five percent but the performance improvement overall for it was only seventeen percent. This does say that for us that the application was late in C. dollar and in some in some respects and that's why business did not translate to a forty five percent performance improvement. So when you try to map a given application in a double digit fashion. You need to figure out if there is communication intensiveness and application in the diligence of dollar and offered latency sensitive and then how much time does obligations spend in communication eventually decides how much actual performance gains you can get if you use this technique. We are also doing this for it like little glass so so far I've covered a lot of glass. OK. It's the same obligation it's water and you're running it. We're just scaling the same problem said it's a twelve kilometer design Lucian. Continental US said. So this is some reason but we have been doing and I've just started to contact the application developers. Because they have more insight into how that application works and I do not but I mean I am not so worried about the number of course because when you have a small partition. For example five hundred twelve cores is only one twenty eight nodes on the bluejeans machine which is a notice of eight by four by four. And so the effective diameter is just four plus two plus two which is eight. So the maximum number of links a message can travel is only it. So the technique of topology of a mapping as I mentioned in the beginning works when you have a very large machine and messages are going to travel very far away say fifteen or twenty. And if you can do that to that much much better. So the I mean typically for most applications like Nnamdi and. Other applications performance improvements start to show up when you have a larger partition. But I would be very wrong. We see these numbers yet I would expect that the improvement on two thousand core should be better than what I get on one thousand course but right now. I don't have a good answer for why this would be happening. OK so I've also been working with it a little grass but I won't go into the details here. The problem is more challenging because you do not have a structure to the pattern of communication and you want to map this under the city told us the number of neighbors can be arbitrary a given process might be communicating with ten neighbors or fifteen neighbors and it's a harder problem than the regular one. So to summarize my research I've been doing on mapping what we have shown is that contention for the. Theme links when Same when the same link is used by a several messages. It reduces the available back of it. The farther messages travel there's more contention and this is a situation we want to avoid apology are very much being is a technique which is important for a certain class of applications sometimes application developers do not realize this sometimes is because machines such as advertise that the value of an mapping is not important. They have good bang of it but we are still shown that we can good get good improvements on machines that fast and connects and finally we are trying to automate this process so that the application developer does not worry about this and for some applications like open eyed on the phone improvements as ideas fifty percent and we expect this to be true for at least certain class of applications which are communication engine surveys do transposes matrix operations and so on. Do I still have five minutes. OK. So I'll just discuss some future work and what are my plans for maybe the next few years. So just some extensions to my thesis work. The current as I'm sure has died their entire communication graph can be collected on a given processor. This is going to become difficult as we run on very large machines. So what Because both in terms of memory. You need to you need to store the communication graph and in terms of the time it takes for the communication graph to be collected on a given processor. So we want to do something which we refer to as hierarchical mapping we want to do a hybrid between centralized and discomfort really distributed. If you do a completely distributed mapping where everyone decides varied wants to go then it doesn't have enough information and so the actual mapping solution is not optimal. So we want to do hybrid. Where we found groups of processors and pro and mapping is done in those groups and there is a top level mapping across the groups which does minimal movements and small Fineman changes. This can overcome memory bottlenecks it can also reduce the time. Takes the actual mapping. And finally we will also like to extend to other context apologies that might show up in the future even within a given. Nor do we might have an apology for how the different notes are connected which note is going through the network and all this needs to be considered when we are doing the mapping. Now I want to focus on serious research in the next few years but with the real impact on scientific applications and the two broad directions I want to focus on is communication optimization algorithms and techniques and load balancing on a good exams at the same study which was done by DARPA which is called Access software study says that as we go to a very large machine we would need a system to do studded cast placement on a one time migration of tasks both of these things and both of these are something which we have been doing for quite some time now in time. Plus plus you can actually do a one time migration of thousands and we would want to but the communication topology onto the underlying topology in a good way. And in either case the one time system is going to be important. It also says that the policy call hints exists and M.P.I. but there are rarely if ever used. And most biggest languages only express local words as a more competition they do not have information about. Richard and accesses the mood portions of which are they and so on. So are you planning to work with impaired developers towards better implementations of empiric art functions so that you can use the to policy. And you can use the deposit information available underneath and do better mappings the algorithms which we develop can be directly translated so it is more awful no work of deploying the same techniques in other programming paradigms. We also want to gather support for topology of our jobs to do list because for machines likely if you don't have a policy of our jobs to do however good your mapping is if you're not as guarded on. Lower the machine you won't get any performance improvements different jobs are still interfering with each other in across the partitions. And then I was it would also require extending the vote to other typologies communication in general is that is going to become a big problem from the current trends we are not increasing link bandwidth on the networks as you are increasing the speed of the processor. So you are you're generating more floating point operations done the bytes you can deliver on the network and going to become a problem and certainly it's better implementation of M.B.A. functions but implementations of M.B.A. collectives and those are some other things I want to work on. The other thing is load balancing So as you move towards access scale. We're going to have radically new scientific applications such as multifarious applications where different phases do different things and you want to load balance all of these phases. According to their communication patterns and competition and loads. There is headed in it in processing elements already but you might also have a positive in the node a single node might be going into the network and all of the North might have to go through it so competition is going to be cheap but communication is going to be expensive and the load balancer is need to be aware of all these things so we need to be aware of her degenerating processing course we need to be aware of temporal changes in load applications with different phases and balances also should be aware of the communication graph of the application. So I want to work in the area of developing tools and techniques for doing this for M.P.I. and other paradigms which might be become important for scientific applications. And this would mean we need to do runtime instrumentation and runtime task migration for some of these. I think that brings me to the end of my talk. I can take any questions. Thank you so. Yes you already have some of it is a living in the sense it wouldn't be defaming issues. Because you don't have a problem off. But you still have problems like going to memory rich rich court is accessing this part of the memory and you want to map things depending on that. So if you have. So different depend on the access partners to memory you might want to map things according to that. So the issues the issues are different but the techniques might be applicable. And some of these cases. Yeah. Yeah yeah yeah I mean some of these things would be applicable but I don't think the trick would work as is for a dozen to network and but winning on the issues we have we might have to develop different metrics for evaluating these things. I mean that becomes more important when you have infinite Bannard work where you have a certain number of. Ports to which you have to go in the network. But for the first networks I don't think that's. That's issue. Yeah. Yeah. But. We haven't. We're looking at that because for us uncertain for irregular communication back then we wondered do something like part matters so that we can do the initial draft partitioning so that we restrict we can minimize the internet communication but that minimizes the internet communication volume. It still doesn't mean so that it reduces the number of bytes which go over the network. It does not minimize the number of hops the Mrs travels. So the problems are related but they're still separate. So that's like a precondition for my algorithm in the sense I can use that to the partition my graph and then still used apology techniques to minimize the number of hops. Between communicating partitions and that's one technique several of the older papers also use. Yeah. No we haven't done our cumbersome it's the only things. Where. Thank you thank you.