Well Richard thank you very much.

It's a pleasure to be here.

It's been actually.

We've been trying to make that happen for
quite a while and so after a year or so.

It finally happened.

I actually really physically
am here it's a pleasure.

So the really nice to come to
the a little bit farther south.

When there is the winter in Pittsburgh so
well done.

Thank you very much for the final thought.

Now that I'm here to talk about a spiral

program generational in a trance from time
beyond basically the going to given all

your talk about the softer side of the
spiral project which has been going on for

about thirty years across twenty
million of us the U.I.U.C. and

Drexel University Of course there are many
many contributors to the project but

what I'm going to talk about today
is mainly trying to work with

Frederick to miss me and then you'll
make following who are aware Ph D.

students of my colleague Marcus to
show who is here and he is one of the.

Main P. I think that now they
were supported by DARPA Office of

Naval Research and companies like
Intel and Mercury over the years.

And yes so that is our starting for

that are the people who have been working
on it and while we all grew older.

Since then.
So

what is the problem that we are working on
or looking at the problem is that if you

look at the space of
architecture some time ago.

Like in the year two thousand
Life was good at least for

the commodity commercial off
the shelf processor effector So

you had single core C.P.U. is yes there
were cache there was a memory hurrican but

that was kind of OK since then
a lot of things have happened.

For example multi-core got
introduced F.P.G.A. became more and

more mainstream and got into computing
graphics processors became deeply G.P.U..

All kind of crazy things and

basically they can get architectural
nobody really knows where they go and

so they have to try out and because of
they have to try all the different things.

There is not really a stand.

Architecture the next generation is not
yet clear what it's going to be except for

it's going to be terrible.

And that brings the problem of
programmability performance portability

and rapid prototyping it's really hard
to get a fast program run and all these

architectures and everything will one
let alone portable across architectures.

Now that we expect but
it becomes it starts looking really bad if

you pick the most nice out of these
processes and look at what happens there.

Let's pick something like that into
the core I seventy flat nice what Core

three gigahertz or something like that and
we run in American

culture it's well understood and people
have been working on for a long long time

like the first to transform but
you see on that at the performance plot.

On the X. axis we have problems
size to power on the Y.

axis you have performance in Google for
perfect and higher is better and you see

the performance of two different libraries
one is Numerical Recipes and the other one

is best code on the miracle recipes is
a library you download from the web.

It's like one page of code.

It's a seen plantations of of a fifty.

If you were good grad student
to implement an F.F.T.

they would not write as good
code as American recipes.

But nevertheless between the miracle
recipes in the best code.

There is between twelve X.
and thirty five X. performance difference.

Now one could say well that's just
different operations can but it turns out

that these two implementations basically
have the same operations count.

It's really a lot of other things and

I'm going to go into what
the problem is in the second.

But the problem is really that F.F.T.

is not special occasion gives
effect of one hundred sixty.

For example when you run the standard
implementation against the best one

out there and so there is something
else going on and it's true for

all the numerical codes and
here we see what's going on.

We see that basically five X. performance
is lost by memory hurrican basically.

Mismanagement of cash the five X..

And there is something called the director
instructions and be given out a three X..

So basically and I think we get fifteen X.

from optimization to out of the thirty
thirty five X. fifteen actually think

organization and then there is multi-core
problem is Station another after three X.

And so because of that happen from an
elaborate development is a total nightmare

at least this and that is the simplest
Nice one of these call of.

Basically big bang of architectures.

Now what are you going to do to get good
performance across all these platforms.

If the simplest one is already like that.

Not spiral if you're trying to do
is to address this question and

make it possible to get high
performance across these platforms.

Now basically the state of the art.

Except for
the opportunity current vicious cycle.

Whenever you platform comes out of
programmers has to go there and

start fine tuning the matrix
multiplication here fifty then

scale a pack and
also forth all the basic kernels.

And basically until has an army
of programmers in Russia I.B.M.

maybe in Russia or
China who knows where and they

all play catch up all the time because
you can never reach the food performance.

Now in two thousand and
five it was time for

the special issue of the proceedings that
you believe show the state of the art in

automatic performance tuning and back
then your thought sparsity was in their

rich Atlas was in their frame of fifty W.

and also spiral and other compiler
techniques in that journal back then.

Describe the state of the art and
automatic performance tuning where

the computer takes part of the problem and
automate it but

it was mostly thread non-parallel Now
the question is how do you move the field

from Think of threaded performance tuning
to multithreaded performance tuning and so

we've done our couple of
steps there in spiral and

the remainder of the talk is about
basically that topic in spiral.

Now.

The organization is I will give you
a little bit an overview of the spiral

spiral framework then we go into parallel
with then I will talk a little bit about

general size libraries and show you
some results that it actually works and

then will conclude from that but
for me it's for that I guess.

So what is a spiral spiral so
traditional code generation or

program generation what you do is well
you have a library you have a synthetic

aperture radar book you have a fifty book
you have some other journal papers and

algorithms you have very good number of
smart people who know architecture and

they start working very very hard and

after a while they get a high performance
optimized for the platform they know and

they may have to go back to the library
around real hard work and there are very

few people who really can build a really
really fast performance lightweight.

Nobody spiral doing the picture is
basically the same except for people

who are looking kind of said replaced
by this red box which is a spiral.

It's basically.

They were cloned.

So basically the point here is that
instead of not having a drop to move

over and suddenly became happy because
they could work on more interesting

problems that shuffling around the
semblance structures in the right order.

And that's actually really happened
with industry for a while first and

people were afraid of the drop and then
they finally realized it's really good for

them because they are playing catch up.

Anyways.
And

now they can play catch up on interesting
problems not the bad tuning of kernels.

So basically spiral is replacing
the labor effort here and

the important thing about it is
it's comparable performance and

library because otherwise
it wouldn't be any good.

And that is what the spiral system that
we've developed over the years is doing

now in a nutshell it's
a library generation system for

conditional come last.

Initially focused on linear transforms but

recently he started branching out and the
now support many other kernels including

some in algebra some communication
image processing kernel.

It supports a wide range of parallel
platforms everything started

out with electorial and back to process
thing then we got into threading method

dreaming gate level
parallelism with an F.P.G.A.

is offloading is basically awful to G.P.U.
or to F.P.G.A..

And our recent research go over all
the years has been to teach the computer

write a fast library.

So basically instead of a human doing
the let the computer do the work.

And the idea is whenever
a new platform comes out.

We want to just regenerate the library.

We don't want to have anybody
work anything but every now and

then something bad happens and
the vendor gets a new idea.

Like why don't we take a graphics card and
glue it onto a C.P.U.

and that is a game changer.

And if things like that happen then we as
a spiral to provider have to go back and

update the tool and
hopefully minor changes in the tool or

not to major changes that will allow for
generation for the new architecture and

we've shown that over the years
that actually really works.

And what's fair.

Also can do is it generates actually
commercial grade software which means over

the years into started to use a spiral
to generate part of the remark kernel

library and performance primitives which
are highly libraries that provide and

have people in Russia optimizing them and
last year we have a commercial entity

called Spiral Gena that was
spun out of university and

you're currently trying to
see where that will go.

So the vision behind spiral is that
when you have an American problem.

Usually there is a thing
Larry that you which is

the program that implements functionality.

The problem program tends to
be over specification and

because of that the compiler runs into
deep problems because the compiler has to

try to relax direct all the information
that was there the human knew but

was thrown away by the program.

And basically up here going from
the Miracle program to the program

is a human effort people have to go to the
library have to understand everything and

once wrote the program with good the press
the button the compiler takes over and

they get the executable just that they
don't get the performance they would

like to know what we're trying to
do is take away the C. program and

say up here use the specification
of what they want to do and

then everything is automated
from the top to the bottom.

That's the philosophy basically get rid of
the program get rid of the singularity.

Now what's the main idea to make it work.

The main idea is that we
have an architecture space.

We have an algorithm space and
we have basically a common

space that describes both architectures
and algorithms and transformations.

And THAT important thing is everything has
to be in the same language so that it can

be many people later symbolically and
it will be learned is that something based

a mountain in algebra with a little bit of
pens out of print a little bit of other

crazy things works so that you can
describe the problem in the structure and

that way we can basically say we think
it parametrized architecture and

space and then we drive thirteen
automatically think the size the programs.

And in some other work I'm not
going to talk about too much as

we can also flip it around and
say given the.

Find the architecture and there is
a thread in that spiral that we actually

build a custom built architecture for our
room classes and to basically code assign

now how the spiral work you have
a problem thirty cations input so you say

something like I want to have an F.F.T. of
size doesn't twenty four or I just want to

have at the fifty library like F.T.W. or I
want to have it if the lab are like I.B.M.

Yes They'll place vacation and
then you stay on that machine.

And then spiral goes
off does its thing and

after a while it becomes a fast
executable program that can be compiled.

That actually implements the function
with the interface on the.

Target machine very fast.

So basically it replaced
the whole programming effort

from specific ation to executable.

And the basic idea is we use the clarity
for presentation of evidence inside

the system in order to make it work and
we use methods like

rewriting systems to transform algorithms
and map algorithms to hardware.

So what's the what's the face of Spyro.

If you go to the web page
something like that that for

every target decoder you go there.

You say I want to have it is used in Y.
Max and

that one here that I
don't know the letter and

then you get all the parameters you
click the button generate code for

sixteen with a Z.
and then after two three four minutes.

You get thousand lines of code
two thousand lines of code into

this is the highly optimized to do.

Pete the best code out
there that exists and

in order to get our Basically our
DARPA manager had to do that himself.

And so it worked.

So basically he used it clicked and of its
plant and he generated the fastest call

out there so we were happy so
that is a spiral in the natural.

Now let's look a little bit
into What's the mathematical

underpinning of the system and
why it works.

Now the origin of Pyro is.

F F T's and Matrix like to
multiplication basically an F.F.T.

is a way to get the frequency
content of a signal which can be

really expressed as a matrix that
the product of a major vector day of data

with a matrix that you
know at compile time.

So it's a matrix like the product and

moreover that matrix has
a certain structure that you

can factor ostomy tricked into thinking
of me three C's like down here.

And basically that brings down the order
of square operations down to order of end

look and operations.

Which makes an fast algorithm for
Matrix a product or

for a single transform
That's the original idea.

And more over.

Structure of these meat receives it so

that you actually can write
it using attends a product or

a chronic a product notation and you
can describe basically all known F.T.L.

Gooden's with a hand or two handful of
symbols and that was the first push

towards qualification of the knowledge of
algorithms and basically it's based on

the work of Charles alone in this book and
then we put it further and also one

of our founders Inspiral Jeremy Johnson
worked a lot in representation of that.

And that is the basic idea.

You can represent.

A van loans framework for a fifty's.

So I don't know the full title
from the top of my head.

So eagerly and the idea is use just a few
from both and describe a huge space of

algorithms and moreover because you
can describe it all mathematically

it becomes tractable to automation now
where they have right now it's barely.

We have about fifty different transforms
like a fifty real transform find.

Cause and transform and so forth.

About two hundred and
a model two hundred fifty which for

those of you who know if he's serious.

And if your is Rader's for example.

So basically these roof teach Spyro What
does it mean to compute an F.F.T.

What does it mean to.

It's a codification of knowledge.

And it basically defies on the order
of two hundred journal papers worth of

knowledge in the field of single process.

So once you have the knowledge you
have a formal grasp on the problem.

And the next question is that was
there fifty's and if this is nice.

How do you go beyond that if these
because the are stuck with a fifty's for

many years and then you want to go
beyond the first idea is to transform.

Is a linear operator with one
input record one out exactly.

And because of that fact
we could use multiple

in algebra in the tens of product
to represent the operations.

Now if you want to go
beyond that you have.

Break that representation and

the question becomes how do you
break it without breaking to much.

And idea now is well first
let's drop the linear part and

say we just have an operator that is
potentially a potential nonlinear and

moreover it may have more than one
input or more than one output.

Now the previous language that tens of
products have written down as a signal

processing language S.P.L. and we had to
extend that language into the operator

language O.L. which is internally or
for preventing that idea and

after we have done that we have to
generalize all the rewriting and

all the infrastructure that make it
possible to General arise to the more

general operator framework and just to
give you a little bit of an idea of event

in the started defining really
strict with medical terms.

What does it mean to be an operator
what are the operations the higher order

operations.

What and so forth.

And moreover we had to break the tens of
product definition because it only defined

for multilinear.

And we said we do something
that looks like a ten for

product feels like a pencil product but
it's actually in the Find and by doing so

we could raise the thing from
the space of linear nonlinear and

again we kept everything that we wanted
for representation of algorithms.

But just doesn't make any sense.

Mathematically anymore but

that's OK because the program
generation of mathematical research.

And so basically that is the operator
language that's behind it and

once you have that you can start out
trying to describe different fields

using the new language and
see how rich is able to describe.

Now here we have the field of
linear transforms with the linear

fit it's two hundred twenty of those.

Now we have for
example Matrix Matrix multiplication.

Turns out if you want to codify the idea
What does it mean to the Matrix Matrix

multiplication you only
need the following ideas.

A single point matrix made its
multiplication is a killer multiplication.

You can cut the matrix frontally or
you can come.

It worked.
Equally you can use trust.

Once you have written that down which
we basically have written down here in

jibberish but

that is the mathematical notation
that we're using then the computer

understands what does it mean to do
Matrix Matrix multiplication blocked.

Now the research question becomes
can you take that input and

generate atlas from that and

the answer is almost Yes So we basically
have something almost like Atlas

not the full blast not everything
everything but in principle basically

compile that down to a library that is
adapted to ten thousand lines of code.

Now the same thing here we have synthetic
aperture radar or we had to talk to

greater experts for a year to find out
what does it mean when you see radar and

after it's done that we could codify the
knowledge what does it mean into a couple

of more rules like that and now spiral
knows what it means to have radar and

spiral can generate rater implementations
from high levels like same thing

happened with the target coder which
is a convolutional encoder for

collusion or codes.

We just had to codify it spiral
are the stands with the president and

it generated the last
page of the film before.

So from that what we see is that many many
can be written in operator language but

it's not clear what can be done and
what cannot be done.

Moreover it's mathematical objects it's
all very fluid and you can bend it and

stretch it in all kind of directions and
the research question is of course

how far can you put it where does it break
down because in the limit it would become

a general purpose language where all the
power that we have would go last because

then all the knowledge about
the domain would be lost and

so we are basically doing the balancing
act of making general enough to get enough

domains in but don't make it too.

General breaks down.

Now once you have to operate
the language formalised you can have

a special purpose compiler that
takes the operators down to code and

here we do it down two feet to metal.

Code but in reality of course it takes it
down to two piece threads or two M.P.I.

not so there is a very general special
purpose compiler that can generate and

take these code down to all
kind of target platforms.

A little bit.

So we're having a student who is
working on computing right now and

we have a couple of preliminary
results but it's ongoing.

Now.

So we sort of what looked into
that a couple of years back

that the student to run away
at some point and then stopped

doing it to pay off on the general purpose
architecture didn't look good enough.

So basically the idea is let's take that
example here which is a tender product

with identity matrix something else.

And for those of you who understand
what the Matrix what it means it becomes

a block to you know matrix so that
thing becomes a loop of the same kernel

over different data sets that
are glued together in memory and

that basic observation that a formal
has meaning in as a program

drives the whole special purpose compiler
and that is one of the key things to

basically take down a similar presentation
all the way to have performance code.

Now the whole flow is you start
out with the functionality

you go through operator language and
a couple of more internal representations

until you finally end up with threading
vector intrinsic sense of worth and

then you use the standard compiler so
that is the spiral box basically.

And here and here basically

that a standard compiler can't do because
there now it's part is way too hard

because standard compiler doesn't
understand what it means you do an F.F.T.

or you trust or you do whatever so
you do it all high level symbolically.

So that the platform knowledge goes.

Now basically what you're doing is of
imitation of the hell of a distraction and

so all the compiler probably.

They usually have go away and
also last for

full automation through
realistic problems.

And also the program.

Yes yes

it can give a little bit of intuition.

Probably a talk by itself but
basically there is a data flow like

structural language that a well and
then there is a loop based

look like language which is going from
the want to the other you make loop

structures explicit which means you
cannot rearrange the data flow anymore.

But you can merge permutations in
the middle of the problem is in data flow.

You cannot deal with readdressing
properly which is very important part

of changes any more.

The parallelism happens on the level of

knowing that the loop level
building clean things up.

In the rules of the moment
basically parallel

in the data flow level
understand the platform and so

map things properly to the architecture
that will go there in a second

dimension the system
with the cross product.

So it's made to be as
orthogonal as possible.

So cold with intrinsics and that's it.

Basically you have
a compiler horoscope and

operating system since there are four
months computing dedicated architectures.

They don't care about
the operating system.

You basically pin down the thread
on a different course.

And that's it.
OK So basically the question that you

just raised brings us right
to the next thing parallel.

And so basically the idea is there
are many many different types of pearl

efficient apparel like multi-core
multithreading thing the method

thing streaming multiple offering
graphics processors F.P.G.A.

gate level or partitioning and
typically when you have your two flow.

You have a very low and a compiler which
tries to really extract parallel and

does as good as the drivers you can and
the compiler people have done a tremendous

job but it's a very hard problem that
doesn't look like completely solved.

So basically what you're doing is the move
that up to the elder an optimisation level

Very have to hold information about the
algorithms and one methodology goes for

all of them they just are different
instantiations So how does it work for

example is shared memory basically S.P.L.

which is one input one object of
version of operator language and

we have one concert
that we've seen before.

What it means like into
the loop over contiguous data.

Now if you see IP and you know that P.

is the number of processors then it's
clear that basically he is run on

the whole process and run the thing eight
and that it's going to be load balance.

Unless crazy and
piece a number of processors

utilize all the resources and
it's going to be nice and

here's the picture for that are all
the red yellow green and blue.

They're all my thing.

The pennant.

So that's nice.

Now there is a counter example to that
which would be something like a matrix

transposition which in our lingo
is the tried permutation for

which is you take it to me tricks and
transpose it into a four.

And that whole thing happens
in linearized memory.

If you have that you see that and if you
know cache line length as to what you see

as you get false sharing basically
the red and the blue process.

They step onto each other's toes and

basically a lot of cash
careers in protocol.

Events happen and the thing runs
perfect of thousands lower than before.

Something like that that's really bad but
moreover we see that we can just look

at a former without ever writing a program
and we know that that's going to be good

and that's going to be bad and moreover
the thing that holds true for cash.

Also holds true for N.P.R.
message and so forth.

So it's really only a question is
how do you interpret the formal and

give meaning to the formal.

And the idea now is to get a formal or
that consists only of these parts and

doesn't have any of those or
as close as possible to that ideal now.

Now how do you take a formal and
translated

by rewriting rules and everything
will rewrite in through basically in

codes the knowledge of a compiler
transformation in a take Texas pattern and

provides the knowledge is that it's legal
and the transformation in one line so

that here for example probably whenever
you want to try Matrix Matrix mutation.

You can block it basically
the position can be blocked and

you can block it in a way that.

The course grain blocks run
on different processors and

the fine grain products
are compatible with your message size.

That is the idea that you can explain.

So you will tell you I will do it
like that that makes a lot of sense.

Now we have a way to take that idea and
write it down formally and

medical is like that.

And so forth.

We have many many rules of that is
the spiral transformation rule based and

the important thing is here our patients
the savior I'm in a shared memory machine

with P. processors and Cashman length
new so the machine parameter goes into

the formula and
when you see here that new clues here and

the P. not appear anywhere.

Let's hear it appears for example.

So what that means is that the well known
linear algebra identities get new meaning.

And basically a thing that.

Man the factorization of a permutation now
means blocking of a matrix transposition.

And so that way we can basically codify
the knowledge about how to optimize

the what is done basically have the two
hundred twenty rules of the transforms

we have these architecture rules and
we can throw them into one big bucket and

have a rule rewriting system work with it
and say Please find me a good solution.

The way that works is you start out and
say we want to have an F.F.T..

Of a certain size and
then the roof starts working and

uses fifty and
all these rituals that we've just seen and

after it's done it comes up with that
formal and informal is not very readable.

But it's also only internal
representation of the spiral system.

And you're not and
start out color coding it and

say well that here has a pencil on
the right side so that is just shifting

around messages of the right length
what is good is no force here in here.

And the blue one has I P ten for so
it's embarrassingly parallel thought it's

nicely balanced and good and because
of that by giving meaning to the former

you can formally derive an algorithm that
in a very simple machine model is good and

now you have many of those algorithms
that are all provably good in the simple

machine model and there you can
start doing empirical tuning and

search the space of
different implementations.

Now there are forty or fifty.

And the same idea applies for
the Matrix Matrix multiplication where you

start out thing a one of
images multiplication.

I have a number of process and
crash land on the system goes off and

ends up with that formula which now is
paralyzed Matrix Matrix multiplication and

again we have red and blue.

Red is again no false
sharing data exchange and

blue is embarrassingly parallel
computation of smaller matrices.

Basically that formal approach just has
to be translated by Pak and compiler into

the open in P. or peace thread or whatever
else that standard engineering is.

To make that nice and
work so that last for

a memory and now we're doing the same
thing for message passing M.P.I.

could be guess if you've guessed it back
and for example from one side twenty

cation X.
all these architectures fall into that.

G.P.U. that here is a formula for

the seventy nine hundred series that we
had some pretty cool results where we did.

Open C.L. and Open G.L.
and C.G. and what not.

And here are very low for F.P.G.A.
and it's all it works it's rigorous It's

called Downward corporate construction
as long as the system works properly and

it overcomes the compiler limitations and
it also allows us

one more interesting thing to do became
do pretty little optimisation basically.

We get a document that describes the new
instructions said without ever having

seen the architecture or the compiler.

We can write our own little emulation
library that just gives meaning to

the instructions.

Then tell spiralled to use these
instructions according to what we've just

seen before and then say please
minimize operations count or

minimize the rated operations count and
according to some model

a spiral random that produces like that
that is sixty four point four there A-P.

where that line here is totally random
because we can't see any numbers.

But we have to code physically and

with this for
the next generation processor of Intel It

took us basically one transatlantic
flight to build the emulator.

And then a couple of days of debugging and
then we were very very fast.

Yes but
that's the last gen size libraries.

And the problem statement
is like the following.

If you want to have
a program that runs not for

a single sized like F.F.T. If I stop.

Twenty four.

But for all the fifty's.

This is a very different problem.

Some domains like if you want to
be the wife of a transmitter in

a defined radio base.

Then you only have to support three or
four standards and

ten to fifteen thousand twenty
four seven sixty eight and

sixty four is good enough and you want
to have the smallest possible code for

the sizes in the original spiral was
targeting this kind of situation and

we've made problems
compile time parameter.

But now when you enter the scientific
computing space that assumption

doesn't hold anymore because people
there want to link to the library and

then run with the same code.

Regardless of what the sizes turned out
that it becomes extremely complicated

problem and my colleague Ph D.
thesis basically was how to solve it.

And here I'm going to give you like
two three slides short version of

what he did in his thesis in collaboration
with the whole group obviously and so

the idea here is then put is I want to
do an F.F.T. which means the D.F.T.

a discrete for
transfer of unknown size and.

I want to use F.F.T.
which is written like that.

I want to have and

I want to do threading and
I want to look like to the outside world.

If I want to.

Now after the system has worked for a
while the output is the optimized library

for example ten thousand lines of C. plus
plus could be fifty thousand lines of C.

or Java Sharp doesn't really

matter which language is general problem
size is not known at compile time.

It's rector at multi-threaded and

has a rundown adaptation mechanism
like a search trial things.

And the performance is
competitive with hand-written and

little libraries for example.

And that's the picture too if you take
the formal specification put it in this

parallel bridge and return and
outcome to have performed.

If you live now.

And you can do that for

one to fifty library that opens up
quite a big space because you can for

example say take the kookier fifty and
make fifteen or it looks like a fifty W..

So basically you generate
something like F.T.W..

Or a subset of it from
a formal specification.

You can also say make it look like
the end care of Mark and a library or

you can say I really like
to look like a fifty pack.

Because it's just the product of
how it looks now the next thing and

you can think of things like half the cell
run on a Jeep you have and Carol.

Run a blue jean things like that that
nobody will ever built by hand of Italy.

Now you can use some algorithms
that basically Marcus

my collaborator has built in his
other research thread which is on.

Signal Processing and
there he developed a new algorithm so

you can build a crude and
transform library.

That can appear like if you
see the obvious cause and

transfer library but
it's just a factor of two to five faster.

Or it can just appear like and
then we can do the same thing for

things like impulse response or

we can do it for
Matrix Matrix multiplication we cannot and

that's why I wrote to the full
Atlas obviously but we can

do something that looks and feels a little
bit like Atlas or M. Care and so forth.

And yet in his theory this is
about fifty plus a fifty W.

like signal processing libraries and
then Frederick them is me the student

extended this thing to Matrix Matrix
multiplication and theory.

Now the idea is that you still have
to think database that we had before.

You have to plant from knowledge
that we have seen before.

Now you do think recursion is the closure

that basically finds
the recursive Eldrid structure.

Once you know the record.

Structure you have paste cases in
recursion steps which basically gives you

recursive functions and
could lead to W. lingo and

then use the old spiral them to generate
actual code for the recursive functions

which is mutually recursive and for
the base cases which are just like.

Then you do something called Hope called
petitioning which is just which parameters

are compile time and
which parameters around time and

which parameters
are initialization time analysis.

And then you try everything to get or
give you the interface and of it goes.

And if you go to the spiral page.

There are a couple of libraries like
that that are there for download and

you can see how these ten
thousand lines look like.

No let's go to the next slide basically So

what what that means is you
start out with the D.F.T. and

you use the F.F.T. fifty algorithm and
you notice the curly braces here and

this curly braces mean what is
a recursive function in the framework and

basically you see that the curly
braces get propagated around and

what assistance figures are the medically
if you know F.F.T. W. that basically

in order to implement a thing like a fifty
W. You cannot just use smaller F.T.'s you

need to do smaller fifty's stride input
stride output and twiddle factors.

That's something that material and
Stephen had to figure out themselves and

not have a formal way of
figuring out of medically.

And that is what it means politically
that some here means that the recursive

pattern which means divide and
conquer to a couple of a fifty's and

then to another couple of
fifties that's what it finds.

The curly braces are problem
specification for

fixed Spyro which basically
means implement codes for

those and so I don't really have any more
time to explain it in more detail but

that that is the idea that you start from
a formal specification through it in and

then a computer algebra system
discovers all the thing.

To be know people had to put their
thought in before to figure it out.

I'm sorry.

Colette is a small piece of code like
F.F.T. of sixty four kind of thing so

they coined the term because it's not full
code it's just a small piece of code.

No no no that they are fully resolved.

So in a rewriting it physically
case it's Terminators.

You know adding a break down rule
is a Ph D. thesis kind of thing.

It's very very complicated.

Yes yes but the user would be a Ph D.
student spiraled him.

So it's so complicated.

It's impossible that.

Somebody who is not permanently in the air
has a chance to understand what's going on

because you can either go top down or
bottom up.

You really have to understand what
is going on in mathematics and

the implementation level to a great level
of detail before you can do anything

for genitally.

So that's just a picture of
if you do a scalar at fifty.

You get four recursive functions and
that's just a picture of their definition.

Basically it's Little told that
a note with the recursion.

So over fifty W.
lingo that what you see here.

Then you know last thing you want to
do it for a cause and transform for

vectorized for blow up into three.

And that are all the probably unreadable
problem specifications that go

now into spiral the bill space
cases these are the take

the specifications of the cool
it's basically the spiral.

Rance approach to formally find
a formal specification for

discrete lengths and
the infrastructure automatically so

that is just a very short because for
many many years that we.

Now we also got access to Blue Gene L.

and Blue Gene P.
and basically their special hardware.

Mark the thread you could just put
in the spiral to make use of it and

it worked back then so

it was very short bursts of just go
there understand the architecture now

is student who worked on
the cell he's about to graduate.

What we found here is we
have two plots one is from

basically as good as it gets when
you stay on chip and allow for

custom data formats that you would do
when you do a kernel that you use another

computation they're killing the data
speeds up to it if you go for up and

here we compare against other
codes out there to see and

what you see is that the value goes up to
two cent even further and it goes up to

twenty Gig of up and basically bottoms
out of twenty give up everything

that range of performance and
you know here basically for

some reason just starts shooting up and
here it runs out of memory and

doesn't work anymore so any of us is
right now trying to read poppyseed and

try to fix and petrol ever is not yet
done but it just shows you.

Well you describe the architecture and
spiral can do it now we

have some results on F.P.G.A. is there we
are as good as the stylings logic or what.

Basically the vendors did other
medically and one important thing

is once you have a library generator you
can play again you can start how much for

four months to get for
one thousand lines of code.

How much performance to get for
a dozen three hundred lines of code.

Yeah let's go to K.
now let's try three K. and so forth.

So that's yet another knob when
you have that you basically

can really trade of performance for.

No you just run.

Yes buy thing you can do up for much and
then you just try out what happened.

I would say translates into a true love.

Here so W two hundred fifty kilos of code.

Now we have some results for Matrix Matrix
multiplication where we are sometimes

fifteen twenty percent slower than
care sometimes on par with go to

the library and once we know
that we're doing rank up there.

We actually are performing
the same kill quite a bit.

So that's the space.

And that basically are kernels that
are self adapting automated almost like

Atlas but
only smaller sizes at this point.

So we've done some with a poor format far

here to basically I think the last story
I'm going to tell this story was charged

with OK there is no carry which got
the best paper of two thousand and

seven which runs a synthetic aperture
radar on a cell with eight cores and

twenty five gigabytes per
second memory bandwidth.

Why don't you get the same performance
on four course with on the eight

megabytes per second memory bandwidth.

So we said OK let's try it and it worked
and it basically was it translates

to extremely aggressive optimization the
feral code generation system was running

for twenty four hours straight
produced to make a batch of code and

then was outperforming the experts.

So basically currently
where are we headed.

One direction is new applications and
algorithms and so

here are basically little in algebra image
processing software defined radio we

just recently started working
coding radar processing so

with every heavy duty processing and
not exactly operations.

Platform whatever is new and
hot and parallel basically for me.

Yeah.

And the other thing I'm just
telling you a little bit.

You're going into the wreckage of LET
from the fire that in and out of talk.

It's just come to a repetition feel more

in summary spiral successfully

is a successful approach to automate
the development of performance libraries.

And it's commercially used painful and
currently working on the commercial

into the spiral gen that
spins out the technology.

What it does is it picks something like
fifty sixty four and produces code

directly of intrinsic and of course
the key ideas are to mean specifically.

Representation based on model
in algebra and pans out.

Oprah and goes well beyond
that if you just change it.

Any really need.

The difficult optimization that usually
are done on a compiler level we do and

a high level of abstraction through
rewriting instead of very expensive

analysis now in case you want to learn
a little more about spiral parts

paralytic academic project or
about bio commercial entity or up there.

Thank you very much and.

Reverting together with German Chancellor
of the university on that topic is a Ph D.

student they are working on that and
you have a couple of first steps.

You're going to go forward.

The code actually was not available for

a long time and the reason for that is not
really the company in the first place.

Now it's changed of course and over it but

the problem is that the system
is really really complicated and

extremely fast moving and
because of that the fact you need to have

somebody getting into training of a one
year it's you know to do something.

Think about what that means for document.

That's not in there.

For you the questions I just tried to
install didn't want to do my fifty but

then the contract doesn't work you
just cannot do that because of that we

decided OK we have all the toys
version on the web and

then only later figured out they
actually need to go commercial because

many things the companies we worked
with wanted us to do You couldn't do in

university because the students have to
graduate and not be engineers programming.

No but it is actually a very too
harsh statement version of spiral

spiral three point one which can do killer
code and loop code which has the S.P.L.

compiler part of the database
should still be available.

I think maybe it's not anymore.

On the Web three point one I would really
have to double check with a divorce.

Maybe until recently.

Maybe one of my co-founders to
get off the web you very much.

Yes

So basically I'm talking with quite a bit.

And so what I personally think is
that you want to have a beck and

call generator and so
we are looking into that and

I think for
example I would never do spiral forth and

forth because the language of pencils so
far has no place there.

I think you have to we talk.

Yeah that's my personal view on it.

You have a good talking here not
running with close ties to the world.

Please thank you very much.