There's a long list of the pieces

of work for your research
supercomputer which is our top

distinguished people working
group last year the recipient of

last year's also is of George Michael or
so.

So it's really a pleasure and
honor of your SO PLEASE JOIN ME MORE on.

Thanks for so little.

I mean audible.

OK so
I'm a doctoral candidate at the university

different I work with person like me
who typically goes by then in Sanjay.

And my doctor they will be
focused on a mighty system

which is on automating typology or
mapping for large panel machines

before I jump into
the into my dissertation.

I'll give a brief motivation for

why we work with this P.C. What
are the kind of applications I work with.

And then I'll try to read my
specific wording in the area

so my interests are in performance
analysis and tuning off.

And within performance analysis I
specifically want on communication

optimizations which tends to boil
down to mapping off applications and

load balancing publications nine million
open not done being two of them.

They're both written in time
plus plus which is a C.

plus plus be used by liberal I mean
model Illinois by my advisor and

the constant goal we are trying
to achieve is better performance

to help scientists get their
simulations done as fast as we can.

So just a few examples of some
applications which using.

Large I look Asians on simple
computers wife is the.

Research and forecasting model.

It's a collaboration between and
God and a few other places.

And in this figure you're
seeing forecasts for

March twenty third of
the precipitation in contender us and

then this one shows the pressure
of values all over the US.

And I just described
the applications first and

then try to motivate I'm talking about
numpties another application which does

modular dynamics here we are seeing one
of the largest simulations ever done.

It's.

Binding off I was on to the channel.

It's a two point seven
million item simulation and

this is a collaboration between my
advisor who heads Apollo program lab and

Prosser plus children at the total
tickling competition by a physics group.

At Illinois.

Another application is Flash which is
used for a cosmological simulations.

And this book.

This one of the game one of the first
fully three types open of explosion and

this is a collaboration between are gone.

Us to do Chicago on a few other places.

So by giving examples
of these applications.

What I'm trying to say is if you try to
run these simulations on a single machine.

It would take years maybe more
than the lifetime of a human and

hence these simulations which of
these typically take a very long time

you're trying to use simple computing
power to get these simulations done

in time ranges which are meaningful for

example for weather forecasting you
want to do it within twenty four hours.

So you can forecast next days weather and
hence there's a lot of work going on

in the P.C. area this is a graph from the
top five hundred list which is published

every six months and it lists the fastest
five hundred supercomputers in the world.

The model line is the range for the five
hundred or the middle one of the ten for

the top of the computer and this is for
the computing power available.

Some of the five hundred
commuters we have.

So we're seeing that we have crossed
the protest kill barrier last year and

we're moving towards X. our scale and so
there's lots of interesting were going on.

Lots of challenges we face as
the scale too large machines.

Looking at the size of these machines.

We have five Sobel computers which have
more than one hundred thousand processors

and about thirty two processors
in total in this range.

And the work which I'm going to
present today is more relevant for

machines of this very large scale.

Although it can be.

It can be made to work for
smaller machines.

It is still to be seen if it has
similar performance improvements on

smaller machines as I would present
today for these large support computers.

So this was just a motivation for
each P.C. I will try.

I will now get into the motivation and

reduction of the work I have been doing
after that I move on to discussing

convention on supercomputers
anthropology of a mapping.

And then I will discuss the automatic
mapping framework which we are trying

to develop which elevates the application
developer from bring them up in itself.

Everything is done in
the runtime automatically.

So the application developer
can get on that legacy and

users can get improvements
from this mapping for free.

So open item is

application highly paralyzer it
runs on large number of course

this graph is from back
in two thousand and

six we were trying to run this on the Blue
Gene L. machine at I.B.M. P.C. Watson.

On the X. axis is the number of
course we were running on on the Y.

axis is the time period to ration
our time for step and seconds and

the simulation is off to do water
molecules it's a very small system.

It's not expected to scale very
well since a small system but

still what we're seeing here is.

Two thousand processors.

We're looking at in the graph and.

The scaling from two thousand
onward is not very good.

And you would ideally you
would expect it to go down

as it is going from five hundred
people to one thousand processors and

we try to do some perform in performance
analysis on the why this is happening.

This is the view from the projections
performance analysis tool which

is a part of John plus plus plus plus
has automatic instrumentation off but

a lot to cations and you can
visualize the lot this is you'll get

to see various things happening with
that application this is one particular

visualization which is called
the timeline view so on the X.

axis is time on the Y.
axis we have few of the processors this

is was a thousand processor run and a few
randomly chosen processors are shown here.

Each of these different colors is one
particular function being executed

on a given processor.

So what we see in this red box is

that there's a lot of write which
is corresponds to I don't I.

And so the processor is doing nothing
just waiting for something to happen.

And for the nonce is sure that most of the
time is is being spent on messages and on

this point all messages are received and
then all processes start executing again.

Now what we noticed was that.

And again you can see this
is around nine seconds.

What we noticed was that messages
of the sizes which are being

sent in this application should
not be taking such a long time and

we attributed this to network contention.

And once we did typology mapping to
avoid contention of the network.

We were able to get much
better performance and

that really sure improved locks after we
discuss what we did to solve the problem.

So if you have one be mesh.

So if you just took processes on one line.

Typically most of what computers today use
vom all routing for sending messages on

the network what that means is a given
message is broken down into smaller flutes

you send a header packet out a header file
it was decided that out of the message.

And all of the remaining flips in
the message just follow the header flow.

Know what that is not based on that model
you can message you can model the message

latencies by this where
the first term is the latency or

the time taken for
the header slit to these the destination.

So you have a difference on the distance
there at each new processor you go through

each outer you pass through
you have some lead in C. and

the second term of the banquet where L.
is a total size of the message and

you divide that by the bandwidth
available on the link so

that gives us the message
the total time for the message.

Who is the destination if you
assume of almost routing model.

Now.

Typically people have
assumed that since the first

turn which is the length
of the header flick.

Times the distance of the distance
the message is startling is small.

You can our neglect the first term and
mostly the time for

messages and is a second term.

But this only works well if you have
this kind of a communication back then.

Where messages are not sharing
network links so you just see that

processors are communicating pairs and
everything works fine.

In such a scenario where you have
different messages cheering.

Links between them then you have
you actually have contention for

some of the middlings in specifically for

this particular link here you have three
messages trying to use that link and

hence the bandwidth available for
each link is reduced by one third.

The bandwidth available to each message
on this link is reduced to one third and

hence every message is delayed.

When in such a situation now when you
have this situation you cannot use that

creation to simply model
message latencies.

So and
again this is the problem we want to avoid

what we want to do is
we want to place this.

We want to place the process on this
processor and the last one closely.

Each other so that messages do
not travel far in the network.

That leads to less contention and

it leads to better messaging time and
to the overall better performance.

And I describe this in
terms of a simple one D.

mesh some of the largest
supercomputers today.

Almost all computers have
some interconnect apology.

Some of them are three dimension measures
so the cream machines sixty three four and

five have a three D. tortoise if it's
the whole machine if you get a smaller

partitions typically it's really mash.

I mean machines B B G L and
B G P three D. do it.

I and then you also have some of the
biggest machines as Infineon cluster So

for example the engine is an infant
about machine the road in a machine

which is a bit of the diagram there is
again infinite and machine Los Alamos.

I.B.M. also has its proprietary Federation
interconnect which is again infinite and

so and again in the future if you might
have even more radical typology such as

the new B G Q machine which will come
up out of the blue what is machine.

And what we want to do is to be able
to exploit these typologies for

better performance.

So that's the topology of the machine.

Most of the time applications also
have a communication topology or

a communication graph.

For example the which was the first
example a short had a symbolic

two dimensional communication back and

so it's a stencil a comfort dition
every process it is talking to for

every process talks to four neighbors
one in each doing these Direction NAMBY

again has a specific communication
paradigm and I won't go into the details.

Flash which is a unstructured competition
will have a regular pattern and so.

So again each application
has a certain communication

graph which we can exploit and
we can Madison but it was a sort of ology.

So that brings us to the.

Mapping problem to a wide contention
what we're trying to do is.

Given the interconnect apology
of the super computer and

the communication typology
of the application.

We want to map the tasks of the federal
entities you know what application

to the physical processors.

To optimize communications.

Our first order of interest is
to balance competition a lot

because if you don't have a load balancer
you want to get good running times.

You do not want a certain
processor to be spent.

You don't want to start in processor to
spend too much time in doing a work while

others are waiting for the first orders
load balance and then the second order

of things is to minimize communication
traffic on the network by locating

communicating objects on nearby processors
and we hope that by minimizing contention.

We can get better performance.

Before I go into the.

Actual techniques.

Just some related work and how my work
is different from on previous work so

in the eighty's there was
a lot of work in a mapping

object graphs on to processor glass.

And most of these were thirty
years studies the processors were

the machines were small.

So for example one twentieth
processors and so on.

So the number of hops or links each
message was travelling was not a lot these

techniques were slow and
offline because they did not get about

time complexity it was like one
point eight processors even if the.

Algorithm was order in Cuba.

It was not it was not taking a long time.

In the ninety's.

What really got through involve the whole
routing were introduced which brought down

the messaging time because the initial
dependence on hops was reduced.

And people thought that mapping
is no longer important with

the upsurge of machines like blue B.G.L.
and B.G.P..

I.B.M. started saying that the policy
is becoming important an obligation

developer should be attention to it and
so some recent work has been done in.

By people at I.B.M.
and by independent application developers.

What I am trying to do here is
the establishing that mapping is important

we have done various convention
benchmarks to prove this.

And we are also trying to show
that this is also important for

machines like a bit of a much faster
interconnect with a very high bandwidth.

And then the what we try to prove
that the work is important.

By using the real scientific applications
and running them in supercomputers using

the algorithms we develop to see if there
are actual performance improvements.

Finally we are trying to develop scalable
and fast runtime solutions because

machines are becoming larger and
larger We want to be.

We want to do things that runtime
as a simulation is proceeding so

we don't want to take a lot of
time doing the load balancing.

And we're trying to make everything
application independent so

we are developing an automated
automated mapping framework which can

take a communication graph for
application build the mapping and

that occasion can just use the mapping
solutions to get better performance.

And just a small comment
about the scope of this work.

We are currently focused on three
dimension to rise machines we

might also look at Infineon networks and
other networks but we haven't done that so

far but still these machines form
a significant percentage of the top

five hundred list in terms of the compute
power they give us maybe not in terms of

the number of process a number
of entries in that list.

And then there are various applications in
parallel computing we can broadly divide

them into competition bound and
communication bound communication bound

applications are the one which which will
be most influenced by these techniques.

Even within communication bound
applications you can divide them into

two categories the latency
tolerant applications.

These are the ones where if
a processor is reading for a message.

It can do some other work NAMBY is
an example of a latency tolerant up to.

It does virtualization So

if a given object on a processor is
doing some is waiting for a message.

Another object in the same process
that continues to do some work

again those applications might
not get that much improvement.

They don't see sensitive
obligations which actually read for

messages before they can
proceed further are the ones

which stand to benefit the most
from apology of a not being.

So any questions so far.

OK so I'll move on to the first
part of my dog which is

on contention and topology of the mapping.

So we are trying to prove
that you're claiming that

convention does affect misses latencies so
we don't do benchmarks in simple and

one of them does a new
convention scenario and

the other one creates contention on
the network in the first one and

this is a one day simplification of
the three daughters on a machine

which was one particular rank on the job
partition which is allocated and

that processor or
the monster processor sends

messages to all of the processors
in the job partition.

But seriously wanted to die.

So it's just sends a message
to this process or

done after that to this process that and
so on.

So you do not create any
convention in the network and

you record the message latencies
thought these message sends.

What we expect is that since there
is no convention on the network.

There should not be significant
dependence on hops or

the number of links messages travel.

And and let's see the results and
I can talk more about this.

So this is how that is looked like for
on the X. axis is the message size.

So we do this sort of
different message sizes and

in the fs of the time it takes for

a ping pong between these chosen process
monster processor and other processors.

There are multiple circles here for

societies which is the time it takes
to send messages to different.

Such as on the network.

Now you see a spread here.

Or for the smaller messages and it
significantly reduces to just one circle.

This is happening because we have
a very small outing ignition.

And the red line shows the difference
between the MIN and MAX year.

So the difference is about thirty percent.

On the lower side and
goes on to around five percent there.

So the difference on the smaller messages
coming from the first or so there

is a difference on hops for small messages
because the first time is not negligible.

When you go to very long messages
the first time is negligible and

hence you just get depends
on the size of the message.

Now compared to this when we go to and

again this is another plug on the Cream
Machine which has a much higher bandwidth

we see similar results for
small messages there is a small dependence

on the number of links rather but for
larger messages there is no dependence.

So these look similar for
the two machines.

When we do a convention benchmark
in this benchmark what we do is.

Processes our peers so
that in the first case

every processor talks to another
processor which is one hop away.

Then the next morning every
processor talks to a processor of

just two hops away.

And so on.

We do this from one hopes to end
hops the maximum or feelings or

the diameter of the network.

What you can see is for three hots.

A given link is being shown for
the example this link is being shared

by this message this message and
this measures.

So we expect that the bandwidth
is being shared and

hence my solution sees what increase.

This is how the plot looks like.

Again on the Y. axis message sizes.

Now what you see is for smaller size
messages there's no dependence on

the number of hops
the message is travelling.

From when we go to large messages.

So all of the red line is the one hockey
is of the baseline where there is

no contention.

Everyone is talking to
someone one hallway.

And as you increase the number of links
each message drivers you see that the time

increases.

Significantly.

This is on the right axis it's a lot
still so the difference between.

The bottom most line.

And the top most or one hundred eight.

Obs is actually sixteen times.

So what this is saying is for small
messages there is no contention because

the number of packages are small for
a lot of messages when you start sending

messages for that are we on the network as
you increase the number of Hobbs' you see

a significant dependence on the mail
from the distance and on contention.

And this is the problem we
want to avoid We do not want

messages to be sent far away on
the network in a real battle application.

These are the results on
the three machine at Pittsburgh.

It's a much faster machine machine in
terms of the interconnect the link

bandwidth on Blue Gene L.
machine is around one seventy five M.B.

per second on this one it's around
three point a deeper second.

So there's a smaller dependence but

it's still about two times between
the one and the seven hundred case.

So we expect that we can that
the techniques we develop will still

be useful on machines like decree.

OK So typically machines

tend to have much better effect of bang
with compared to the peak advertised and

with and for clear machines it's much less
this would be around two G.B. per second.

Compared to the point it
was there were days and

still be within ten percent of
the peak advertised bandwidth.

So on create machines you get
much actual bandwidth and

they're advertising much about it.

So now I'll move on.

Do.

A case study for open item and how we
did typology of that mapping there and

the performance improvements the opt in.

So we already saw this plot.

We have a king there in the graph.

We're trying to see why
the performance problems are there.

We found out that convention
on the network was a problem.

We also knew this because this
was developed in-house and

we knew that this is a very communication
intensive application it does a lot

of fifty's transposes which leads
to all to our communication.

Many to Many communication.

So just a small introduction an open item.

It's an issue.

M.V. code.

Communication in this application is
started does not change with time.

If you map objects and
certain processors you

will get the same communication
pattern as iterations evolve.

It is a regular in the sense you
can make nice graphs in two D.

or three D. for
the communication patterns.

The only challenge here is that we
have multiple groups of objects.

Because this is written
in jumpers plus and

you can have multiple areas of objects
which are a map of the machine and.

Unlike M.P.I. where you just have
one process or physical process or

so there are multiple
groups of objects they have

conflicting communication patterns and
that's the challenge.

So the way this is
application is paralyzed and

I won't go into the details but
these three are the more important it is

in terms of the communication
requirements.

Space and real space communicate with
each other in there in plain vice.

So this particular rule
of space communicates

with this particular ROVs real space.

Space and
communicate stay device on thing and

the other way around this is state
wise and this is plain rice.

So this particular column of G.

space communicates with the first plane
here of peer calibrator and so on.

So you see there is a conflict.

Pardon.
In the sense we want to map this this

group and this group in one dimension and
this group and

this group in the other dimension and we
want to co-locate all of these together so

that the communication is minimized this
work is joint work with Dr Glenn Martin

at I.B.M. and Dr Marc Tuckerman at N.Y.U.
and Eric Von from my lab.

So what we did was we did our to
volunteer mapping of Charice which

are the objects in jump
last Plus that is the G.

space objects we first mad that
on to the three to us partition.

Once we do the mapping of that
we try to map communicating.

States of real space close to
the communicating space of just bits of

just because Mab year we
tried to deal space here.

But what that does is it to
start least keep pace and

we'll see objects in the same area or same
region of the part toward us partition.

And depending on how the space is not for
example this green plane.

We try to map this plane
close to the objects.

And these are the three important
communication patterns are several other

communication patterns and other objects
in open eyed demos are also Mab according

to those using this mapping we were able
to bring down the overall time pushed for

the supplications significantly.

So this is the only one which took
around eight point five seconds.

The new one need takes
five point two seconds and

you see that the time spent in this
phase is reduced significantly.

We were also able to do is the time for
this to this one and

this might just be because of
better load balance in this case.

It's also important to remember that
this particular mapping the default one

was not an intelligent it was
a very good load balanced nothing.

It was just not communication of it or
it was not topology of it so

it was not trying to map objects.

Physically on physically close.

Just doing typology mapping

brings us to this better performance
of a lot of the time Bush term.

Yes yes.

I mean this is I've just got a graph at
the time when we are certain boundaries.

So this might still be from the last phase
in the last part of the times to appear so

I don't have a good answer for that but.

There might be several
possible reasons for that.

It might be but

it might be because US since
the mapping has been done differently.

There might be other things
which can lead to this.

But overall we are able to deduce the
phases which used to take the most time

and hence.

We get much better improvement on this.

So you are asking what is the optimal
time that location can do.

I don't know if I have
a good estimate on that.

Yeah I mean.

The best could be if we
could sink all of this here.

Yes although you have you
still have all the advice.

But it might be that because it
looks like that this block here is

a reduction it even is reducing
the process as you go so.

And you are still the black
here to spread all over.

So if you could get further we
might get much better improvement.

Yeah yeah I mean the best possible
would be if you could get

out of this white space the right
space here and then here and.

So what this does is it saves.

So what simulation which would have
been which would have taken six

months now only takes three
months we are there and

they use it to have the time
on eight thousand processors.

So it seems over allocation time when
the supercomputer it also gets us faster

sciences olds.

And so it's better both ways.

This is how the scaling
performance looks for this up.

Cation that was on eight
thousand processors.

Now the red line is the new plot
with the topology of a mapping.

Since this is time for
iteration load is better.

So we're able to get better timings.

We got rid of this artefact here and
the time is much better.

And as we scale up we're going to be doing
much better scaling this is for a smart

system of credit to water molecules
we can see what it looks like for.

A larger system which is to
have this is water molecules.

Again this is in the same Blue Gene L.
machines.

And we still get not as
good performance but

we still get good performance around here
would actually do it again doing half

of the time we were doing taking for
the before and mapping.

And both of these systems
are benchmarks used by C.S.

computer scientists to
benchmark this application.

We actually also used a real.

System which was being used
by use by scientists and

we saw that also gets similar
performance improvements and

we still see good improvements between
the Green Line and the purple line there.

We also want to make sure that this
works well with other machines.

So these are results on the could
actually three machine.

Now created C has is a is a is a three D.
Taurus.

But the sorry the job scheduler
there is not topology of air.

So if you decrease for a job you
might get nodes all over the network.

So these runs are done through a system
has a ration where we can are located.

Three D. contiguous partitions and

then we do our runs to see if
we get better improvements.

And if you can see we this
is the default line and

since we're getting similar improvements.

Just like.

It was important to remember that Korea
has twenty one times more bandwidth than

B.G.L. at least the peak
advertised values and

still communication is a problem we are
still able to get good improvements and.

This is for the other system and
we see good improvements here too.

So.

OK so this was a case study for
our open item.

And we also did work on Nnamdi and

since I have limited time I
would not be going into that but

once we did mapping for these two
applications we realized that application.

We were looking here is application
developers really that we should

not have to do this in the application
the runtime should be able to do

the mapping automatically so that
the application can get this for free.

So the major part of my thesis is
on automatic mapping framework.

We are developing for both Charmin M.B.A.

applications and in future for
other bottle programming models.

What we want to do is opt in
the communication graph do some fact

in matching to find out if
the communication graph is regular or

irregular depending on
the communication graph.

We choose the mapping
algorithms we develop.

So I have not get into the if they were
glass in the start but are you talking

about how we are in the process or
topology graph and the communication and

I'll give some examples of how we do
mapping for certain applications.

So that doing the two inputs we need for

the framework is the process of the policy
graph and the communication graph for

the application then we do part in
matching to our in defier that are two D.

three D. or forty near neighbor
communication patterns.

If it does that which is very
different where the regular patterns

we use started here are sticks and

if it's a regular Titan then we use a
different set of more general heuristics.

The first thing is to open the processor
topology graph the application or

the runtime would need information
such as dimensions of the partition.

And the mapping of the two physical
coordinates so which M.P.A.

process is on which physical processor and
how should I move it around.

And most of us are so we have developed
a super manager which is a uniform A.P.I.

available on I.B.M.

in Korea machines which we make
available on IB even clean machines

on B D L N B G P V provide a wrapper for
system cause which are already available.

Again advertises that you do not
need to do to apology that mapping

they have a fast and.

So they're not a system calls to obtain.

The policy information when you're
running a job so we work with stuff it or

another B.S.C. built in this information
ourselves and we have to opt in that

information but this interface makes
things independent of the machine.

Whether you Don and I B M B G machine or
a clicks the machine.

As long as you're running on a thirty
dollars machine you can get this

information about what
are the demolitions of a job partition.

And then do the mapping
of your application.

Or yes you know.

So most M.P. applications do not
use it for example the only player.

So I guess what you are saying is
if you use the M.P.A. card options

does it have the underlying topology
information and does it do a good mapping.

So the only employee implementation
I know is the I.B.M. Blue Gene P. one

which actually tries to do something when
you try to map from when you use when you

specify your topology for that application
but like on our Infineon machine.

These do not translate to
a good mapping underneath and

they have no information whatsoever
about the underlying machine topology.

And that's one thing we're trying to do.

We're trying to push
these things into M.P.I.

you saw documented of sort of
build up who leads some N.P.R..

And we're trying to see
if we can make the M.P.A.

card implementation typology of air on
other and very implementations as well.

The next thing we need is
the object communication graph.

The readers to if you are a developer for

the graph you could opt out if you're the
application developer for the application.

You could out obtain it manually.

You know that location
the communication patterns.

But we want to do is do it automatically.

For N.P.R. Blick Asians We use profiling.

So we use I.B.M.'s it's P.C.
toolkit to get the communication matrix of

the application which a process
talks to which other M.B.A.

process and how many by said communicated.

When we were with done justice
applications the time has

a instrumentation framework which can
give you all this information at runtime.

Once we have the graph.

We want to visualize the graph and
do pattern matching.

So for example looking at the research and
forecasting model.

This is the one of the application
on three two processors of B.G.P..

And this is how the communication
matrix looks like you have zero to

thirty one M.B. A process easier and for

each process you have which
other M.P.'s answer talks to.

This is a very regular application so the
number of bytes communicated is the same.

So the color of the squares are similar.

This is supposed to be so
that if you have a different if you have

different amount of bytes you
would see different things.

Now given this information.

If you're trying to do automatic mapping
you want to know of this is a if this is

a regular pattern.

So we use pattern matching to find
out of this is a regular pattern.

It turns out that in this case it's
a simple two D. communication.

So this is how the communication
looks like it's eight by third

communication graph where each and

PE process communicates with four
neighbors in each direction.

For some applications we do have
wrap around for some we don't.

So ones.

We have the communication pattern we want
to do a mapping and provide this mapping

solution to the application so
that when it does a new one next time

it can use the mapping file passed on
the bobbing file to the job to do that and

change the ordering of the mapping
of ranks on the physical processors.

So I'll be discussing how I would be
discussing some algorithms we developed

for structured ought to
communication patterns.

So we have developed a suite of humoristic
techniques to do this mapping and

just discussing two of them but
I'll show results from several.

One possible techniques so
this is the left

is the object graph on the right is
the processor graph you're trying to map.

This object graph on to
this processor graph.

Again the communication by did as you mean
here is structured two dimensional stencil

like competition so
an easy mapping would be

you take the maximum region which you
can overlap between these two graphs.

You map those regions of the object
out of the processor graph and

then you're left with this
regions which are unmapped and

then you do recursive cause
to the same thing again.

So now you try to map
the smaller region on the diet.

Another thing would be you could
rotate the graph initially so

that the definitions kind of much but
they could be different and

you would have to work with whatever you
have another possibility to stick is where

you start from one corner
of the object graph and

try to place that corner
under the processor graph.

So you map this one here.

Then you might have the next two
which are near neighbors of it here

in the map these here and so on.

Although when you start reading
the end you'll have some mess here.

Which you need to handle.

The technique which has been used in and
this was not developed by us.

It has been used in V.L.S.I. and
this is the people which mentioned that.

So you take this object graph and
you want to model.

Under this process that graph.

Again this is already a line in
the longer dimensions are matching.

What we do if we map one
drawer at a time so you mad.

The first row of
the object graph like this

the new map the second row third row and
so on.

So this is how the mapping of the object
graph under the processor is rough.

Looks like and I'll show how
this this mapping looks like for

readers who just picks.

Will remember that there are some
connections which are not showing here.

This actually looks like this
doctoral communication pattern.

So if I draw all the wood tickle
connections on that graph.

It looks like this.

So what does happening is some of these
connections are being stretched here.

So this particular slide is so

is showing the mapping solutions for
a given object graph and

a given processor graph using
various different shooter sticks.

The framework at runtime to does
the hugest tick which is best and

I next described how do we select
the history which is best.

We use the hot bytes metric

to read mapping algorithms have braces
the weighted sum of the message sizes and

the way it is the number of hops or links
it misses travels in the new mapping.

So effectively for each message we
multiply the distance it travels by

the number of bytes or the size of
the message and we sum this for

all messages then messages
travel further on the network.

This turn will be big and
you'll good a bigger value for the hub.

Which means there is more contention.

Because if every Mrs Starling for that
means you're creating more contention.

If you can keep this well you know
you will get a smaller boats value.

And this will indicate there's less and
less convention in the network.

So we use the we use a heuristic
which has the lowest ha

bytes which indicates that we'd be
creating minimum convention the network.

Peevishly and.

Metric was used and
it is still used in V.L.S.I.

sort of design which is
the maximum dilution so

you try to find out the dilution
the maximum dilution for any given edge.

In the graph and

we think that the first one is a is
a better metric for a battle computing.

And again I cannot go into the details
right now but we can discuss this later

for the thick serious texts I mention
these are how the whole box look so

depending on those this is given for

this particular case that
one looks like the best one.

It has the lowest hobbies and

that is chosen at random automatically for
the mapping.

It turns out that for this particular
case that one also gives us the smallest

dilation and so it is good.

Either way.

Now once we have this.

We want to testify that algorithms
are actually doing good for

real applications so I'll be chosing we're
using one of the applications we discussed

earlier which is what we have actually
done this for three different applications

milk is a large disk you see the
application pop is the pollution program.

And vote of our for you sort of weather
modeling I'll be showing how we do

how we did this for a what if this
is dying to work with isn't I.B.M..

What we do is we take the application.

We don't need on Blue Gene do use a P.C.

the tools to dumb
the communication patterns.

Then we do pattern matching on based
on that we did I am having offline for

a given communication graph and
for a given process or

the policy or the number of
processes we're going to run on and

then we posit a new mapping Felician
to the job scheduler on Blue Gene.

For the subsequent run.

So what we saw was who
just export to duty.

Not since we were trying to map to
the communication grass to the three D.

tortoise of Blue Gene Autry we need to
somehow map from two D. to three D..

So.
We do what we do is before the election

get off on to the daughters there
are two ways we could fall.

Most of the times this
one very tried to keep.

Neighbors on the same side of the tourist
worlds better than the other technique.

And again.

One time the best today has to get stores
and and then falling for that one is done.

So we did this for a off.

This is how the look like for
this application so on the X.

axis is the number of course
we run Vodafone on the Y.

axis is the average
number of hotspot a bite.

Processor.

So average hoped for
by the similar to the hops.

Metric is telling us how many links
are hops each message is travelling.

So you can see that for the default
mapping or whatever default mapping is

done by the Blue Gene jobs could do
little on the Blue Gene P. machine.

We get around every message
actually dollars to links.

We were able to reduce this
to less than one point five.

And as we go.

Are we actually are getting
much bigger improvements so.

At two thousand processors every message
was traveling three hundred it now.

Travels close to one hop.

So we have brought the communication down
to nearly able in actual physical topology

not just on the virtual processor not
just on the virtual application apology.

It remains to be seen if these
improvements in hearts lead to

a better performance for the application.

So.

For the communication timeout so
that we divide the running time for

the application into the come
communication time and the competition

time for these two cases the communication
time bruised by two percent.

This case the communication time
improved by forty five percent.

We have led to a significant
reduction in the number of cards for

this case we see a increase
in the communication time.

But we still see good improvements for
both of these cases.

So we see the improvement
of seventeen percent in

the overall application performance.

On one thousand cores and

on off eight percent on two
thousand course we're certain to

look at we get this increase in
the communication time and not a decrease.

And we still have to figure out how we're
still going to performance improvement.

It might possibly be because of little
contention in the network and so.

What is important to remember to
remember is that this is a very complex.

That each got a lot peckish and
is fond of several combinations and

improving performance is very complex.

So for example in this particular case and
one thousand course we were able to do

the hops by sixty four percent the
communication time to do the forty five

percent but the performance improvement
overall for it was only seventeen percent.

This does say that for
us that the application was late in C.

dollar and in some in some respects and
that's why

business did not translate to a forty
five percent performance improvement.

So when you try to map a given
application in a double digit fashion.

You need to figure out if there is
communication intensiveness and

application in the diligence of dollar and
offered latency sensitive and

then how much time does obligations spend
in communication eventually decides

how much actual performance gains you
can get if you use this technique.

We are also doing this for
it like little glass so so

far I've covered a lot of glass.

OK.

It's the same obligation it's water and
you're running it.

We're just scaling the same problem said
it's a twelve kilometer design Lucian.

Continental US said.

So this is some reason but

we have been doing and I've just started
to contact the application developers.

Because they have more insight into
how that application works and

I do not but I mean I am not so

worried about the number of course
because when you have a small partition.

For example five hundred twelve cores
is only one twenty eight nodes on

the bluejeans machine which is
a notice of eight by four by four.

And so the effective diameter is just
four plus two plus two which is eight.

So the maximum number of links
a message can travel is only it.

So the technique of topology of a mapping
as I mentioned in the beginning

works when you have
a very large machine and

messages are going to travel very
far away say fifteen or twenty.

And if you can do that to
that much much better.

So the I mean typically for
most applications like Nnamdi and.

Other applications performance
improvements start to show up when you

have a larger partition.

But I would be very wrong.

We see these numbers yet I would expect
that the improvement on two thousand

core should be better than what I get
on one thousand course but right now.

I don't have a good answer for
why this would be happening.

OK so I've also been working
with it a little grass but

I won't go into the details here.

The problem is more challenging because
you do not have a structure to the pattern

of communication and you want to map this
under the city told us the number of

neighbors can be arbitrary a given process
might be communicating with ten neighbors

or fifteen neighbors and it's a harder
problem than the regular one.

So to summarize my

research I've been doing on mapping what
we have shown is that contention for the.

Theme links when Same when the same
link is used by a several messages.

It reduces the available back of it.

The farther messages travel
there's more contention and

this is a situation we want to avoid

apology are very much being is a technique
which is important for a certain class of

applications sometimes application
developers do not realize this sometimes

is because machines such as advertise that
the value of an mapping is not important.

They have good bang of it but we
are still shown that we can good get good

improvements on machines that fast and

connects and finally we are trying
to automate this process so

that the application developer does
not worry about this and for some

applications like open eyed on the phone
improvements as ideas fifty percent and

we expect this to be true for
at least certain class of applications

which are communication engine surveys do
transposes matrix operations and so on.

Do I still have five minutes.

OK.

So I'll just discuss some future work and

what are my plans for
maybe the next few years.

So just some extensions to my thesis work.

The current as I'm sure has died their
entire communication graph can be

collected on a given processor.

This is going to become difficult
as we run on very large machines.

So what Because both in terms of memory.

You need to you need to store
the communication graph and

in terms of the time it takes for

the communication graph to be
collected on a given processor.

So we want to do something which we refer
to as hierarchical mapping we want to do

a hybrid between centralized and
discomfort really distributed.

If you do a completely distributed mapping
where everyone decides varied wants to go

then it doesn't have
enough information and so

the actual mapping
solution is not optimal.

So we want to do hybrid.

Where we found groups of processors and
pro and

mapping is done in those groups and
there is a top level mapping across

the groups which does minimal
movements and small Fineman changes.

This can overcome memory bottlenecks
it can also reduce the time.

Takes the actual mapping.

And finally we will also like to extend to
other context apologies that might show

up in the future even within a given.

Nor do we might have an apology for how
the different notes are connected which

note is going through the network and

all this needs to be considered
when we are doing the mapping.

Now I want to focus on serious
research in the next few years but

with the real impact on scientific
applications and the two broad directions

I want to focus on is communication
optimization algorithms and techniques and

load balancing on a good exams at
the same study which was done by DARPA

which is called Access software study says
that as we go to a very large machine

we would need a system to do studded cast
placement on a one time migration of tasks

both of these things and both of these
are something which we have been doing for

quite some time now in time.

Plus plus you can actually do a one time
migration of thousands and we would want

to but the communication topology onto
the underlying topology in a good way.

And in either case the one time
system is going to be important.

It also says that the policy
call hints exists and M.P.I.

but there are rarely if ever used.

And most biggest languages only express
local words as a more competition they

do not have information about.

Richard and accesses the mood
portions of which are they and so on.

So are you planning to work with impaired
developers towards better implementations

of empiric art functions so
that you can use the to policy.

And you can use the deposit information
available underneath and do better

mappings the algorithms which we develop
can be directly translated so it is more

awful no work of deploying the same
techniques in other programming paradigms.

We also want to gather support for
topology of our jobs to do

list because for machines likely if
you don't have a policy of our jobs

to do however good your mapping
is if you're not as guarded on.

Lower the machine you won't get
any performance improvements

different jobs are still interfering with
each other in across the partitions.

And then I was it would also require
extending the vote to other typologies

communication in general is that
is going to become a big problem

from the current trends we are not
increasing link bandwidth on the networks

as you are increasing
the speed of the processor.

So you are you're generating more
floating point operations done the bytes

you can deliver on the network and going
to become a problem and certainly it's

better implementation of M.B.A.
functions but implementations of M.B.A.

collectives and those are some
other things I want to work on.

The other thing is load balancing So
as you move towards access scale.

We're going to have radically
new scientific applications

such as multifarious applications where
different phases do different things and

you want to load balance
all of these phases.

According to their communication
patterns and competition and loads.

There is headed in it in processing
elements already but you might also

have a positive in the node a single
node might be going into the network and

all of the North might have to go through
it so competition is going to be cheap but

communication is going to be expensive and
the load balancer is need to be aware of

all these things so we need to be aware of
her degenerating processing course we need

to be aware of temporal changes in load
applications with different phases and

balances also should be aware of the
communication graph of the application.

So I want to work in the area of
developing tools and techniques for

doing this for M.P.I.

and other paradigms which might be become
important for scientific applications.

And this would mean we need to
do runtime instrumentation and

runtime task migration for some of these.

I think that brings me
to the end of my talk.

I can take any questions.

Thank you so.

Yes you already have

some of it is a living in the sense
it wouldn't be defaming issues.

Because you don't have a problem off.

But you still have problems like
going to memory rich rich court is

accessing this part of the memory and
you want to map things depending on that.

So if you have.

So different depend on the access partners
to memory you might want to map things

according to that.

So the issues the issues are different but
the techniques might be applicable.

And some of these cases.

Yeah.

Yeah yeah yeah I mean some
of these things would be

applicable but I don't think
the trick would work as is for

a dozen to network and
but winning on the issues

we have we might have to develop different
metrics for evaluating these things.

I mean that becomes more important
when you have infinite Bannard work

where you have a certain number of.

Ports to which you have
to go in the network.

But for
the first networks I don't think that's.

That's issue.

Yeah.

Yeah.

But.

We haven't.

We're looking at that because for
us uncertain for

irregular communication back then we
wondered do something like part matters so

that we can do the initial
draft partitioning so

that we restrict we can minimize
the internet communication but

that minimizes the internet
communication volume.

It still doesn't mean so

that it reduces the number of
bytes which go over the network.

It does not minimize the number
of hops the Mrs travels.

So the problems are related but
they're still separate.

So that's like a precondition for

my algorithm in the sense I can use
that to the partition my graph and

then still used apology techniques
to minimize the number of hops.

Between communicating partitions and

that's one technique several
of the older papers also use.

Yeah.

No we haven't done our
cumbersome it's the only things.

Where.

Thank you thank you.