Well Richard thank you very much. It's a pleasure to be here. It's been actually. We've been trying to make that happen for quite a while and so after a year or so. It finally happened. I actually really physically am here it's a pleasure. So the really nice to come to the a little bit farther south. When there is the winter in Pittsburgh so well done. Thank you very much for the final thought. Now that I'm here to talk about a spiral program generational in a trance from time beyond basically the going to given all your talk about the softer side of the spiral project which has been going on for about thirty years across twenty million of us the U.I.U.C. and Drexel University Of course there are many many contributors to the project but what I'm going to talk about today is mainly trying to work with Frederick to miss me and then you'll make following who are aware Ph D. students of my colleague Marcus to show who is here and he is one of the. Main P. I think that now they were supported by DARPA Office of Naval Research and companies like Intel and Mercury over the years. And yes so that is our starting for that are the people who have been working on it and while we all grew older. Since then. So what is the problem that we are working on or looking at the problem is that if you look at the space of architecture some time ago. Like in the year two thousand Life was good at least for the commodity commercial off the shelf processor effector So you had single core C.P.U. is yes there were cache there was a memory hurrican but that was kind of OK since then a lot of things have happened. For example multi-core got introduced F.P.G.A. became more and more mainstream and got into computing graphics processors became deeply G.P.U.. All kind of crazy things and basically they can get architectural nobody really knows where they go and so they have to try out and because of they have to try all the different things. There is not really a stand. Architecture the next generation is not yet clear what it's going to be except for it's going to be terrible. And that brings the problem of programmability performance portability and rapid prototyping it's really hard to get a fast program run and all these architectures and everything will one let alone portable across architectures. Now that we expect but it becomes it starts looking really bad if you pick the most nice out of these processes and look at what happens there. Let's pick something like that into the core I seventy flat nice what Core three gigahertz or something like that and we run in American culture it's well understood and people have been working on for a long long time like the first to transform but you see on that at the performance plot. On the X. axis we have problems size to power on the Y. axis you have performance in Google for perfect and higher is better and you see the performance of two different libraries one is Numerical Recipes and the other one is best code on the miracle recipes is a library you download from the web. It's like one page of code. It's a seen plantations of of a fifty. If you were good grad student to implement an F.F.T. they would not write as good code as American recipes. But nevertheless between the miracle recipes in the best code. There is between twelve X. and thirty five X. performance difference. Now one could say well that's just different operations can but it turns out that these two implementations basically have the same operations count. It's really a lot of other things and I'm going to go into what the problem is in the second. But the problem is really that F.F.T. is not special occasion gives effect of one hundred sixty. For example when you run the standard implementation against the best one out there and so there is something else going on and it's true for all the numerical codes and here we see what's going on. We see that basically five X. performance is lost by memory hurrican basically. Mismanagement of cash the five X.. And there is something called the director instructions and be given out a three X.. So basically and I think we get fifteen X. from optimization to out of the thirty thirty five X. fifteen actually think organization and then there is multi-core problem is Station another after three X. And so because of that happen from an elaborate development is a total nightmare at least this and that is the simplest Nice one of these call of. Basically big bang of architectures. Now what are you going to do to get good performance across all these platforms. If the simplest one is already like that. Not spiral if you're trying to do is to address this question and make it possible to get high performance across these platforms. Now basically the state of the art. Except for the opportunity current vicious cycle. Whenever you platform comes out of programmers has to go there and start fine tuning the matrix multiplication here fifty then scale a pack and also forth all the basic kernels. And basically until has an army of programmers in Russia I.B.M. maybe in Russia or China who knows where and they all play catch up all the time because you can never reach the food performance. Now in two thousand and five it was time for the special issue of the proceedings that you believe show the state of the art in automatic performance tuning and back then your thought sparsity was in their rich Atlas was in their frame of fifty W. and also spiral and other compiler techniques in that journal back then. Describe the state of the art and automatic performance tuning where the computer takes part of the problem and automate it but it was mostly thread non-parallel Now the question is how do you move the field from Think of threaded performance tuning to multithreaded performance tuning and so we've done our couple of steps there in spiral and the remainder of the talk is about basically that topic in spiral. Now. The organization is I will give you a little bit an overview of the spiral spiral framework then we go into parallel with then I will talk a little bit about general size libraries and show you some results that it actually works and then will conclude from that but for me it's for that I guess. So what is a spiral spiral so traditional code generation or program generation what you do is well you have a library you have a synthetic aperture radar book you have a fifty book you have some other journal papers and algorithms you have very good number of smart people who know architecture and they start working very very hard and after a while they get a high performance optimized for the platform they know and they may have to go back to the library around real hard work and there are very few people who really can build a really really fast performance lightweight. Nobody spiral doing the picture is basically the same except for people who are looking kind of said replaced by this red box which is a spiral. It's basically. They were cloned. So basically the point here is that instead of not having a drop to move over and suddenly became happy because they could work on more interesting problems that shuffling around the semblance structures in the right order. And that's actually really happened with industry for a while first and people were afraid of the drop and then they finally realized it's really good for them because they are playing catch up. Anyways. And now they can play catch up on interesting problems not the bad tuning of kernels. So basically spiral is replacing the labor effort here and the important thing about it is it's comparable performance and library because otherwise it wouldn't be any good. And that is what the spiral system that we've developed over the years is doing now in a nutshell it's a library generation system for conditional come last. Initially focused on linear transforms but recently he started branching out and the now support many other kernels including some in algebra some communication image processing kernel. It supports a wide range of parallel platforms everything started out with electorial and back to process thing then we got into threading method dreaming gate level parallelism with an F.P.G.A. is offloading is basically awful to G.P.U. or to F.P.G.A.. And our recent research go over all the years has been to teach the computer write a fast library. So basically instead of a human doing the let the computer do the work. And the idea is whenever a new platform comes out. We want to just regenerate the library. We don't want to have anybody work anything but every now and then something bad happens and the vendor gets a new idea. Like why don't we take a graphics card and glue it onto a C.P.U. and that is a game changer. And if things like that happen then we as a spiral to provider have to go back and update the tool and hopefully minor changes in the tool or not to major changes that will allow for generation for the new architecture and we've shown that over the years that actually really works. And what's fair. Also can do is it generates actually commercial grade software which means over the years into started to use a spiral to generate part of the remark kernel library and performance primitives which are highly libraries that provide and have people in Russia optimizing them and last year we have a commercial entity called Spiral Gena that was spun out of university and you're currently trying to see where that will go. So the vision behind spiral is that when you have an American problem. Usually there is a thing Larry that you which is the program that implements functionality. The problem program tends to be over specification and because of that the compiler runs into deep problems because the compiler has to try to relax direct all the information that was there the human knew but was thrown away by the program. And basically up here going from the Miracle program to the program is a human effort people have to go to the library have to understand everything and once wrote the program with good the press the button the compiler takes over and they get the executable just that they don't get the performance they would like to know what we're trying to do is take away the C. program and say up here use the specification of what they want to do and then everything is automated from the top to the bottom. That's the philosophy basically get rid of the program get rid of the singularity. Now what's the main idea to make it work. The main idea is that we have an architecture space. We have an algorithm space and we have basically a common space that describes both architectures and algorithms and transformations. And THAT important thing is everything has to be in the same language so that it can be many people later symbolically and it will be learned is that something based a mountain in algebra with a little bit of pens out of print a little bit of other crazy things works so that you can describe the problem in the structure and that way we can basically say we think it parametrized architecture and space and then we drive thirteen automatically think the size the programs. And in some other work I'm not going to talk about too much as we can also flip it around and say given the. Find the architecture and there is a thread in that spiral that we actually build a custom built architecture for our room classes and to basically code assign now how the spiral work you have a problem thirty cations input so you say something like I want to have an F.F.T. of size doesn't twenty four or I just want to have at the fifty library like F.T.W. or I want to have it if the lab are like I.B.M. Yes They'll place vacation and then you stay on that machine. And then spiral goes off does its thing and after a while it becomes a fast executable program that can be compiled. That actually implements the function with the interface on the. Target machine very fast. So basically it replaced the whole programming effort from specific ation to executable. And the basic idea is we use the clarity for presentation of evidence inside the system in order to make it work and we use methods like rewriting systems to transform algorithms and map algorithms to hardware. So what's the what's the face of Spyro. If you go to the web page something like that that for every target decoder you go there. You say I want to have it is used in Y. Max and that one here that I don't know the letter and then you get all the parameters you click the button generate code for sixteen with a Z. and then after two three four minutes. You get thousand lines of code two thousand lines of code into this is the highly optimized to do. Pete the best code out there that exists and in order to get our Basically our DARPA manager had to do that himself. And so it worked. So basically he used it clicked and of its plant and he generated the fastest call out there so we were happy so that is a spiral in the natural. Now let's look a little bit into What's the mathematical underpinning of the system and why it works. Now the origin of Pyro is. F F T's and Matrix like to multiplication basically an F.F.T. is a way to get the frequency content of a signal which can be really expressed as a matrix that the product of a major vector day of data with a matrix that you know at compile time. So it's a matrix like the product and moreover that matrix has a certain structure that you can factor ostomy tricked into thinking of me three C's like down here. And basically that brings down the order of square operations down to order of end look and operations. Which makes an fast algorithm for Matrix a product or for a single transform That's the original idea. And more over. Structure of these meat receives it so that you actually can write it using attends a product or a chronic a product notation and you can describe basically all known F.T.L. Gooden's with a hand or two handful of symbols and that was the first push towards qualification of the knowledge of algorithms and basically it's based on the work of Charles alone in this book and then we put it further and also one of our founders Inspiral Jeremy Johnson worked a lot in representation of that. And that is the basic idea. You can represent. A van loans framework for a fifty's. So I don't know the full title from the top of my head. So eagerly and the idea is use just a few from both and describe a huge space of algorithms and moreover because you can describe it all mathematically it becomes tractable to automation now where they have right now it's barely. We have about fifty different transforms like a fifty real transform find. Cause and transform and so forth. About two hundred and a model two hundred fifty which for those of you who know if he's serious. And if your is Rader's for example. So basically these roof teach Spyro What does it mean to compute an F.F.T. What does it mean to. It's a codification of knowledge. And it basically defies on the order of two hundred journal papers worth of knowledge in the field of single process. So once you have the knowledge you have a formal grasp on the problem. And the next question is that was there fifty's and if this is nice. How do you go beyond that if these because the are stuck with a fifty's for many years and then you want to go beyond the first idea is to transform. Is a linear operator with one input record one out exactly. And because of that fact we could use multiple in algebra in the tens of product to represent the operations. Now if you want to go beyond that you have. Break that representation and the question becomes how do you break it without breaking to much. And idea now is well first let's drop the linear part and say we just have an operator that is potentially a potential nonlinear and moreover it may have more than one input or more than one output. Now the previous language that tens of products have written down as a signal processing language S.P.L. and we had to extend that language into the operator language O.L. which is internally or for preventing that idea and after we have done that we have to generalize all the rewriting and all the infrastructure that make it possible to General arise to the more general operator framework and just to give you a little bit of an idea of event in the started defining really strict with medical terms. What does it mean to be an operator what are the operations the higher order operations. What and so forth. And moreover we had to break the tens of product definition because it only defined for multilinear. And we said we do something that looks like a ten for product feels like a pencil product but it's actually in the Find and by doing so we could raise the thing from the space of linear nonlinear and again we kept everything that we wanted for representation of algorithms. But just doesn't make any sense. Mathematically anymore but that's OK because the program generation of mathematical research. And so basically that is the operator language that's behind it and once you have that you can start out trying to describe different fields using the new language and see how rich is able to describe. Now here we have the field of linear transforms with the linear fit it's two hundred twenty of those. Now we have for example Matrix Matrix multiplication. Turns out if you want to codify the idea What does it mean to the Matrix Matrix multiplication you only need the following ideas. A single point matrix made its multiplication is a killer multiplication. You can cut the matrix frontally or you can come. It worked. Equally you can use trust. Once you have written that down which we basically have written down here in jibberish but that is the mathematical notation that we're using then the computer understands what does it mean to do Matrix Matrix multiplication blocked. Now the research question becomes can you take that input and generate atlas from that and the answer is almost Yes So we basically have something almost like Atlas not the full blast not everything everything but in principle basically compile that down to a library that is adapted to ten thousand lines of code. Now the same thing here we have synthetic aperture radar or we had to talk to greater experts for a year to find out what does it mean when you see radar and after it's done that we could codify the knowledge what does it mean into a couple of more rules like that and now spiral knows what it means to have radar and spiral can generate rater implementations from high levels like same thing happened with the target coder which is a convolutional encoder for collusion or codes. We just had to codify it spiral are the stands with the president and it generated the last page of the film before. So from that what we see is that many many can be written in operator language but it's not clear what can be done and what cannot be done. Moreover it's mathematical objects it's all very fluid and you can bend it and stretch it in all kind of directions and the research question is of course how far can you put it where does it break down because in the limit it would become a general purpose language where all the power that we have would go last because then all the knowledge about the domain would be lost and so we are basically doing the balancing act of making general enough to get enough domains in but don't make it too. General breaks down. Now once you have to operate the language formalised you can have a special purpose compiler that takes the operators down to code and here we do it down two feet to metal. Code but in reality of course it takes it down to two piece threads or two M.P.I. not so there is a very general special purpose compiler that can generate and take these code down to all kind of target platforms. A little bit. So we're having a student who is working on computing right now and we have a couple of preliminary results but it's ongoing. Now. So we sort of what looked into that a couple of years back that the student to run away at some point and then stopped doing it to pay off on the general purpose architecture didn't look good enough. So basically the idea is let's take that example here which is a tender product with identity matrix something else. And for those of you who understand what the Matrix what it means it becomes a block to you know matrix so that thing becomes a loop of the same kernel over different data sets that are glued together in memory and that basic observation that a formal has meaning in as a program drives the whole special purpose compiler and that is one of the key things to basically take down a similar presentation all the way to have performance code. Now the whole flow is you start out with the functionality you go through operator language and a couple of more internal representations until you finally end up with threading vector intrinsic sense of worth and then you use the standard compiler so that is the spiral box basically. And here and here basically that a standard compiler can't do because there now it's part is way too hard because standard compiler doesn't understand what it means you do an F.F.T. or you trust or you do whatever so you do it all high level symbolically. So that the platform knowledge goes. Now basically what you're doing is of imitation of the hell of a distraction and so all the compiler probably. They usually have go away and also last for full automation through realistic problems. And also the program. Yes yes it can give a little bit of intuition. Probably a talk by itself but basically there is a data flow like structural language that a well and then there is a loop based look like language which is going from the want to the other you make loop structures explicit which means you cannot rearrange the data flow anymore. But you can merge permutations in the middle of the problem is in data flow. You cannot deal with readdressing properly which is very important part of changes any more. The parallelism happens on the level of knowing that the loop level building clean things up. In the rules of the moment basically parallel in the data flow level understand the platform and so map things properly to the architecture that will go there in a second dimension the system with the cross product. So it's made to be as orthogonal as possible. So cold with intrinsics and that's it. Basically you have a compiler horoscope and operating system since there are four months computing dedicated architectures. They don't care about the operating system. You basically pin down the thread on a different course. And that's it. OK So basically the question that you just raised brings us right to the next thing parallel. And so basically the idea is there are many many different types of pearl efficient apparel like multi-core multithreading thing the method thing streaming multiple offering graphics processors F.P.G.A. gate level or partitioning and typically when you have your two flow. You have a very low and a compiler which tries to really extract parallel and does as good as the drivers you can and the compiler people have done a tremendous job but it's a very hard problem that doesn't look like completely solved. So basically what you're doing is the move that up to the elder an optimisation level Very have to hold information about the algorithms and one methodology goes for all of them they just are different instantiations So how does it work for example is shared memory basically S.P.L. which is one input one object of version of operator language and we have one concert that we've seen before. What it means like into the loop over contiguous data. Now if you see IP and you know that P. is the number of processors then it's clear that basically he is run on the whole process and run the thing eight and that it's going to be load balance. Unless crazy and piece a number of processors utilize all the resources and it's going to be nice and here's the picture for that are all the red yellow green and blue. They're all my thing. The pennant. So that's nice. Now there is a counter example to that which would be something like a matrix transposition which in our lingo is the tried permutation for which is you take it to me tricks and transpose it into a four. And that whole thing happens in linearized memory. If you have that you see that and if you know cache line length as to what you see as you get false sharing basically the red and the blue process. They step onto each other's toes and basically a lot of cash careers in protocol. Events happen and the thing runs perfect of thousands lower than before. Something like that that's really bad but moreover we see that we can just look at a former without ever writing a program and we know that that's going to be good and that's going to be bad and moreover the thing that holds true for cash. Also holds true for N.P.R. message and so forth. So it's really only a question is how do you interpret the formal and give meaning to the formal. And the idea now is to get a formal or that consists only of these parts and doesn't have any of those or as close as possible to that ideal now. Now how do you take a formal and translated by rewriting rules and everything will rewrite in through basically in codes the knowledge of a compiler transformation in a take Texas pattern and provides the knowledge is that it's legal and the transformation in one line so that here for example probably whenever you want to try Matrix Matrix mutation. You can block it basically the position can be blocked and you can block it in a way that. The course grain blocks run on different processors and the fine grain products are compatible with your message size. That is the idea that you can explain. So you will tell you I will do it like that that makes a lot of sense. Now we have a way to take that idea and write it down formally and medical is like that. And so forth. We have many many rules of that is the spiral transformation rule based and the important thing is here our patients the savior I'm in a shared memory machine with P. processors and Cashman length new so the machine parameter goes into the formula and when you see here that new clues here and the P. not appear anywhere. Let's hear it appears for example. So what that means is that the well known linear algebra identities get new meaning. And basically a thing that. Man the factorization of a permutation now means blocking of a matrix transposition. And so that way we can basically codify the knowledge about how to optimize the what is done basically have the two hundred twenty rules of the transforms we have these architecture rules and we can throw them into one big bucket and have a rule rewriting system work with it and say Please find me a good solution. The way that works is you start out and say we want to have an F.F.T.. Of a certain size and then the roof starts working and uses fifty and all these rituals that we've just seen and after it's done it comes up with that formal and informal is not very readable. But it's also only internal representation of the spiral system. And you're not and start out color coding it and say well that here has a pencil on the right side so that is just shifting around messages of the right length what is good is no force here in here. And the blue one has I P ten for so it's embarrassingly parallel thought it's nicely balanced and good and because of that by giving meaning to the former you can formally derive an algorithm that in a very simple machine model is good and now you have many of those algorithms that are all provably good in the simple machine model and there you can start doing empirical tuning and search the space of different implementations. Now there are forty or fifty. And the same idea applies for the Matrix Matrix multiplication where you start out thing a one of images multiplication. I have a number of process and crash land on the system goes off and ends up with that formula which now is paralyzed Matrix Matrix multiplication and again we have red and blue. Red is again no false sharing data exchange and blue is embarrassingly parallel computation of smaller matrices. Basically that formal approach just has to be translated by Pak and compiler into the open in P. or peace thread or whatever else that standard engineering is. To make that nice and work so that last for a memory and now we're doing the same thing for message passing M.P.I. could be guess if you've guessed it back and for example from one side twenty cation X. all these architectures fall into that. G.P.U. that here is a formula for the seventy nine hundred series that we had some pretty cool results where we did. Open C.L. and Open G.L. and C.G. and what not. And here are very low for F.P.G.A. and it's all it works it's rigorous It's called Downward corporate construction as long as the system works properly and it overcomes the compiler limitations and it also allows us one more interesting thing to do became do pretty little optimisation basically. We get a document that describes the new instructions said without ever having seen the architecture or the compiler. We can write our own little emulation library that just gives meaning to the instructions. Then tell spiralled to use these instructions according to what we've just seen before and then say please minimize operations count or minimize the rated operations count and according to some model a spiral random that produces like that that is sixty four point four there A-P. where that line here is totally random because we can't see any numbers. But we have to code physically and with this for the next generation processor of Intel It took us basically one transatlantic flight to build the emulator. And then a couple of days of debugging and then we were very very fast. Yes but that's the last gen size libraries. And the problem statement is like the following. If you want to have a program that runs not for a single sized like F.F.T. If I stop. Twenty four. But for all the fifty's. This is a very different problem. Some domains like if you want to be the wife of a transmitter in a defined radio base. Then you only have to support three or four standards and ten to fifteen thousand twenty four seven sixty eight and sixty four is good enough and you want to have the smallest possible code for the sizes in the original spiral was targeting this kind of situation and we've made problems compile time parameter. But now when you enter the scientific computing space that assumption doesn't hold anymore because people there want to link to the library and then run with the same code. Regardless of what the sizes turned out that it becomes extremely complicated problem and my colleague Ph D. thesis basically was how to solve it. And here I'm going to give you like two three slides short version of what he did in his thesis in collaboration with the whole group obviously and so the idea here is then put is I want to do an F.F.T. which means the D.F.T. a discrete for transfer of unknown size and. I want to use F.F.T. which is written like that. I want to have and I want to do threading and I want to look like to the outside world. If I want to. Now after the system has worked for a while the output is the optimized library for example ten thousand lines of C. plus plus could be fifty thousand lines of C. or Java Sharp doesn't really matter which language is general problem size is not known at compile time. It's rector at multi-threaded and has a rundown adaptation mechanism like a search trial things. And the performance is competitive with hand-written and little libraries for example. And that's the picture too if you take the formal specification put it in this parallel bridge and return and outcome to have performed. If you live now. And you can do that for one to fifty library that opens up quite a big space because you can for example say take the kookier fifty and make fifteen or it looks like a fifty W.. So basically you generate something like F.T.W.. Or a subset of it from a formal specification. You can also say make it look like the end care of Mark and a library or you can say I really like to look like a fifty pack. Because it's just the product of how it looks now the next thing and you can think of things like half the cell run on a Jeep you have and Carol. Run a blue jean things like that that nobody will ever built by hand of Italy. Now you can use some algorithms that basically Marcus my collaborator has built in his other research thread which is on. Signal Processing and there he developed a new algorithm so you can build a crude and transform library. That can appear like if you see the obvious cause and transfer library but it's just a factor of two to five faster. Or it can just appear like and then we can do the same thing for things like impulse response or we can do it for Matrix Matrix multiplication we cannot and that's why I wrote to the full Atlas obviously but we can do something that looks and feels a little bit like Atlas or M. Care and so forth. And yet in his theory this is about fifty plus a fifty W. like signal processing libraries and then Frederick them is me the student extended this thing to Matrix Matrix multiplication and theory. Now the idea is that you still have to think database that we had before. You have to plant from knowledge that we have seen before. Now you do think recursion is the closure that basically finds the recursive Eldrid structure. Once you know the record. Structure you have paste cases in recursion steps which basically gives you recursive functions and could lead to W. lingo and then use the old spiral them to generate actual code for the recursive functions which is mutually recursive and for the base cases which are just like. Then you do something called Hope called petitioning which is just which parameters are compile time and which parameters around time and which parameters are initialization time analysis. And then you try everything to get or give you the interface and of it goes. And if you go to the spiral page. There are a couple of libraries like that that are there for download and you can see how these ten thousand lines look like. No let's go to the next slide basically So what what that means is you start out with the D.F.T. and you use the F.F.T. fifty algorithm and you notice the curly braces here and this curly braces mean what is a recursive function in the framework and basically you see that the curly braces get propagated around and what assistance figures are the medically if you know F.F.T. W. that basically in order to implement a thing like a fifty W. You cannot just use smaller F.T.'s you need to do smaller fifty's stride input stride output and twiddle factors. That's something that material and Stephen had to figure out themselves and not have a formal way of figuring out of medically. And that is what it means politically that some here means that the recursive pattern which means divide and conquer to a couple of a fifty's and then to another couple of fifties that's what it finds. The curly braces are problem specification for fixed Spyro which basically means implement codes for those and so I don't really have any more time to explain it in more detail but that that is the idea that you start from a formal specification through it in and then a computer algebra system discovers all the thing. To be know people had to put their thought in before to figure it out. I'm sorry. Colette is a small piece of code like F.F.T. of sixty four kind of thing so they coined the term because it's not full code it's just a small piece of code. No no no that they are fully resolved. So in a rewriting it physically case it's Terminators. You know adding a break down rule is a Ph D. thesis kind of thing. It's very very complicated. Yes yes but the user would be a Ph D. student spiraled him. So it's so complicated. It's impossible that. Somebody who is not permanently in the air has a chance to understand what's going on because you can either go top down or bottom up. You really have to understand what is going on in mathematics and the implementation level to a great level of detail before you can do anything for genitally. So that's just a picture of if you do a scalar at fifty. You get four recursive functions and that's just a picture of their definition. Basically it's Little told that a note with the recursion. So over fifty W. lingo that what you see here. Then you know last thing you want to do it for a cause and transform for vectorized for blow up into three. And that are all the probably unreadable problem specifications that go now into spiral the bill space cases these are the take the specifications of the cool it's basically the spiral. Rance approach to formally find a formal specification for discrete lengths and the infrastructure automatically so that is just a very short because for many many years that we. Now we also got access to Blue Gene L. and Blue Gene P. and basically their special hardware. Mark the thread you could just put in the spiral to make use of it and it worked back then so it was very short bursts of just go there understand the architecture now is student who worked on the cell he's about to graduate. What we found here is we have two plots one is from basically as good as it gets when you stay on chip and allow for custom data formats that you would do when you do a kernel that you use another computation they're killing the data speeds up to it if you go for up and here we compare against other codes out there to see and what you see is that the value goes up to two cent even further and it goes up to twenty Gig of up and basically bottoms out of twenty give up everything that range of performance and you know here basically for some reason just starts shooting up and here it runs out of memory and doesn't work anymore so any of us is right now trying to read poppyseed and try to fix and petrol ever is not yet done but it just shows you. Well you describe the architecture and spiral can do it now we have some results on F.P.G.A. is there we are as good as the stylings logic or what. Basically the vendors did other medically and one important thing is once you have a library generator you can play again you can start how much for four months to get for one thousand lines of code. How much performance to get for a dozen three hundred lines of code. Yeah let's go to K. now let's try three K. and so forth. So that's yet another knob when you have that you basically can really trade of performance for. No you just run. Yes buy thing you can do up for much and then you just try out what happened. I would say translates into a true love. Here so W two hundred fifty kilos of code. Now we have some results for Matrix Matrix multiplication where we are sometimes fifteen twenty percent slower than care sometimes on par with go to the library and once we know that we're doing rank up there. We actually are performing the same kill quite a bit. So that's the space. And that basically are kernels that are self adapting automated almost like Atlas but only smaller sizes at this point. So we've done some with a poor format far here to basically I think the last story I'm going to tell this story was charged with OK there is no carry which got the best paper of two thousand and seven which runs a synthetic aperture radar on a cell with eight cores and twenty five gigabytes per second memory bandwidth. Why don't you get the same performance on four course with on the eight megabytes per second memory bandwidth. So we said OK let's try it and it worked and it basically was it translates to extremely aggressive optimization the feral code generation system was running for twenty four hours straight produced to make a batch of code and then was outperforming the experts. So basically currently where are we headed. One direction is new applications and algorithms and so here are basically little in algebra image processing software defined radio we just recently started working coding radar processing so with every heavy duty processing and not exactly operations. Platform whatever is new and hot and parallel basically for me. Yeah. And the other thing I'm just telling you a little bit. You're going into the wreckage of LET from the fire that in and out of talk. It's just come to a repetition feel more in summary spiral successfully is a successful approach to automate the development of performance libraries. And it's commercially used painful and currently working on the commercial into the spiral gen that spins out the technology. What it does is it picks something like fifty sixty four and produces code directly of intrinsic and of course the key ideas are to mean specifically. Representation based on model in algebra and pans out. Oprah and goes well beyond that if you just change it. Any really need. The difficult optimization that usually are done on a compiler level we do and a high level of abstraction through rewriting instead of very expensive analysis now in case you want to learn a little more about spiral parts paralytic academic project or about bio commercial entity or up there. Thank you very much and. Reverting together with German Chancellor of the university on that topic is a Ph D. student they are working on that and you have a couple of first steps. You're going to go forward. The code actually was not available for a long time and the reason for that is not really the company in the first place. Now it's changed of course and over it but the problem is that the system is really really complicated and extremely fast moving and because of that the fact you need to have somebody getting into training of a one year it's you know to do something. Think about what that means for document. That's not in there. For you the questions I just tried to install didn't want to do my fifty but then the contract doesn't work you just cannot do that because of that we decided OK we have all the toys version on the web and then only later figured out they actually need to go commercial because many things the companies we worked with wanted us to do You couldn't do in university because the students have to graduate and not be engineers programming. No but it is actually a very too harsh statement version of spiral spiral three point one which can do killer code and loop code which has the S.P.L. compiler part of the database should still be available. I think maybe it's not anymore. On the Web three point one I would really have to double check with a divorce. Maybe until recently. Maybe one of my co-founders to get off the web you very much. Yes So basically I'm talking with quite a bit. And so what I personally think is that you want to have a beck and call generator and so we are looking into that and I think for example I would never do spiral forth and forth because the language of pencils so far has no place there. I think you have to we talk. Yeah that's my personal view on it. You have a good talking here not running with close ties to the world. Please thank you very much.