[00:00:05] >> So. If. There's. One of the things that I was coming. From. All. Those. Great work. On. OK so yeah Good morning everybody especially for an external visit those Welcome to Judge X. So I'm to schottische Lyman assistant professor in the sea and I'm primarily a computer architect and over the past couple of years we've been doing some work on you know D.N.A. and data and hardware design for enabling but reserve general purpose AI So of course ever since we've you know invented computers we've inventing like are creating artificial intelligence has kind of been the dream date and Thankfully due to deep learning we have actually taken massive strides towards that direction so you know practitioners like and doing actually call the new electricity because there is probably no industry that will not be affected by it so we've already seen it getting deployed across vision imaging text to speech documentation games and there are a number of industries and scientific applications that are using this continues to grow so just to give you a very quick primer on what deep learning is for those of you are not very familiar so the underlying mechanism is essentially a deep neural network so akin to you know biological neural networks you have computing genes known as neurons you have they are connected by what are known as synopses and you have a bunch of layers so because you have a lot of layers that's why it's so deep network and each. [00:02:02] Synopses associated with the weight each. You know Don is associated with some input and then each output essentially is a weighted sum of bread you were made created some of all of the weights of inputs and apply some activation function so that's at a very high level the computer that happens within these networks and the deep learning landscape essentially has 2 key components there's the training phase and the inference phase so interest is something that we are actually see happening in most of our phones today where Let's take an image I send it through your networks in this case it has a bunch of cord layers that extracts certain features from this image then there's some layers to summarize these features and if you train your network well this will actually identify that this is the Klaus building that we are in today so what how do you train this network where you actually start with tons of the live build images of let's say the clouds building in this example you again send it through a network and you're trying to design this network so you start with the simpler unity and model which is the network topology and where it's you compute how much air you're getting you have so many practitioner who's kind of guiding this process updating the topology as you go along and eventually you meet some bond and you're happy with the network and this is what Ben gets deployed for inference so within this landscape this kind of in terms of computing platforms training involves a lot of compute It's often done on large scale H.B.C. clusters often involving a lot of G.P. use and inference happens mostly on the edge so you the idea is to have it occur over to your edge devices let your cell phone smart driving cars over custom accelerators and then kind of go into why we make these decisions later in the. [00:03:46] So there's a few challenges in designing and deploying if you really wanted to be pervasive So 1st we really need to design the D.N.A. and architecture of the D.N.A. and model which is part of the training phase then once we get this model we really need to map it onto our hardware substrate so that's more where the compiler map would come in and then you actually have the accelerate the micro architecture and to get these collectively determine what your energy and runtime will be what you're really which is what you really want to optimize so my group has been looking at all aspects of the outline of might all feel actually covered 3 projects within each of these spaces just to give you a quick glimpse so 4 D. in an architectural talk about a project we recently did called Genesis for mapping I describe a tool we have called Mr and for the accelerator my truck addicted to design we have called me and again I said I'd give you a quick glimpse but I'm available through the day for more questions and a deeper dive so let's start with Genesis. [00:04:47] So let's come back to the training framework that I showed here and let's ask you know what if I don't have label data sets so they may be a lot of problems when I just do not have data that is you know available in this form in terms of labor leaders that I may not have a D.N.A. and model to start with I don't even know what the neural network topology should be for this problem so yes for the majors I know it should be maybe convolutions But if it's some unknown problem I don't even know what should happen maybe I don't even have an animal practitioner available you know who can help me design this because I'm trying to do this Arthur normally let's say on a cell driving card and I don't have an X. P.C. cluster I really want to do training on the edge so if any of these are all of these happen deep learning as it is today is not really viable for continuous learning so what we've been looking at this trying to do you know continuous learning at the edge in the absence of as I said one or more of these components so our template for continuous learning looks something like this so we have some environment let's say you know there's an agent that's learning how to play the game Super Mario Brothers I have my inference agent this is the neural network that's actually going to run to play this game take some actions let's say it's last minute to jump then it gets some reward which is essentially you know the scored from the game tomorrow to move forward get some other board so this keeps happening after a while accumulated towards or sent to some learning agent which basically the Eliza's How will this network is doing and it updates that apology and waits and sends it over to the entrance agent and this now a bit neural network topology plays the game again sends another set of accumulated rewards gets an updated set of the apology and waits and as you can see this network becomes more and more complex so what we're trying to do here is learn how to improve one task by making the network better and better at the same time also learn multiple to us so you could employ the same scenario for any number of games or environments so this of course needs to be continuously. [00:06:46] Interacting with the environment and the learning agent needs to be a little bust because it needs to be able to learn multiple times so the question is you know how do you design such a learning engine again you know repeating the fact that you are operating in a domain where you just don't have data or you know people who know exactly what the new network should be. [00:07:04] So there's a there's a few approaches to this what we have looked at and what we converged on was a new revolutionary algorithm so I give you a very quick glimpse of how these work so the idea is you know you start with a very simple network this is just 2 layers inputs and outputs and in fact you start with and of these so akin to you know biological evolution you actually have an a population of networks they all try and interact with the environment so the neural network here in terms of terminology is expressed as a graph so these are genes these are the season edges of the graph and the genome is the entire neural network so this is how evolution algorithms define genes and genomes in this context and so you start with all of these networks they all interact with the environment play the game get some kind of a score this is translated to some fitness and the best networks the ones that actually did better are sent on to the next generation through an evolution phase and after the evolution theory is the networks become more complex you actually choose these genes from the networks that did better you send this forward these again play the game and this continues and so on until you actually get a final more complex network that is trained for the task you had at hand essentially So that's the basic idea and internally these networks are just running Genet algorithms or some variants of genetic algorithms so you have these parent pool of genomes which was your networks you choose the fittest one send them over to some biological log such as crossover and mutations radio essentially this is again on the locus to the training step you're removing layers are being neurons removing your runs updating rate so all of that happens through this phase and you actually get to the final design. [00:08:49] So as a systems person there's a lot of interesting properties that these algorithms give so at the algorithmic level of course they are very robust and the reason is they can actually operate in any environment and they don't need the traditional training mechanism I talked about so in this example here. [00:09:09] A good new revolutionary algorithm has is learning how this lunar lander should line between these 2 flag poles and this is the fitness that keeps increasing and so as you can see over time it actually learns how to land completely in between these and you can compare just change the fitness function and say OK now I want to look to have this and learn how to walk and suddenly the same design will hook all you had to change was this type of parameter This is of course a more complex problem it would be slower but good at Mickey the advantage is that you know there's very little tuning required from manual tuning required so again for somebody who's not an animal practitioner an expert you can deploy these in the field and they will look like they're just just a trade off on how much time they'll take. [00:09:51] And so the systems perspective these are actually very interesting algorithms they have a lot of them so the genomes are the networks that you're running can all go on and probably in fact even the evolution step going on in little because genes are evolved independently E.G. in can go through crossover mutation and so on and it doesn't necessarily have to you know communicate a lot unlike traditional like if you're familiar with backpropagation there's a lot of exchanges that doesn't happen there's no back propagation to no gradient calculations of storage very low memory footprint because all you're storing is genomes in the current generation and very simple hardware friendly OP So again as a computer architect this is interesting when all you need is max for the influence and crossover and mutation steps which are very simple so it's actually very viable for continuous learning and so using this. [00:10:37] This intuition we actually went ahead and design and so see for such a system so it had an evolution engine it had an inference engine and it had a C.P.U. to perform the interaction and this is a paper we just presented this week so more details are here and there's a terrible love to talk to you more about this in more detail and so we actually saw that the system that we built called Genesis for gene system essentially has 10210000 X. more energy efficiency than custom C.P.U.'s and G.P.S. OK So that was on the D N an architecture side of things OK so now that I have a D. and an architecture How do I actually deployed so this is where I'll talk about our next book called Mistral which is on the mapping end of things so again you know I mentioned that you deploy these DNS today on custom accelerators Well why do you need custom accelerators but it turns out that DNS today are very complex they have millions of parameters over it as you can see just with some of the decent networks with millions of these but i'm It doesn't need to billions of computation so you really need a lot of compute and naturally all of this data don't fit on chip so there's a lot of data movement between memory and on chips and as you all know moving data is very expense. [00:11:51] If the moving data from an off. To the chip is almost 100 to 500 X. more expensive in terms of energy than just local movement here so you really need to reduce energy so keeping these 2 goals in mind and you know at a very high level this is what makes heap use inefficient This is what makes use inefficient people have been developing custom D.N.A. and accelerate doesn't you know this is a space that is very very active like every 3 days you will see a company announcing their own custom accelerator and so the broad idea from a 100 foot view for most of these accelerators is the following your data the Google T.P. you the Xilinx and they all look something like this the idea is that you spread your competition across hundreds of where you so you have a lot of compute let's read them across hundreds of tiny use and to reduce data movement let's try and reuse did our video in the area so let's try to avoid going to memory back and forth let's try to reuse data within the city via just local messages and local scratch beds. [00:12:53] So but the problem is now you have this very large neural network that you need to map onto your accelerator hardware like this could be an insignia could be an F.P.G.A. or this could be an X. P.C. system but all of these are the P.C. nodes and the problem is you have a huge computation space 7 dimensional there's a 4 dimensional result space there's lot of opportunities for data reuse you could be using them poorly you could reduce it spatially or you could reuse it speciate importantly the over space and time this of this millions of non-trivial mappings and the energy both benefits depend heavily on your neural network it depends on your hardware and it depends on your mapping strategies so how do you explore all of these possibility to fuse and to do this we've built a tool called Mr Which is an analytical cost benefit model this is something we did in collaboration with Nvidia So the idea is you take your D.N.A. and description you take. [00:13:45] Approximate description of your hardware and a mapping strategy you send it to the tool it does a bunch of analysis through just analytical equations and tells you you know buffered it gives you cost for Buffer counts cost for you interconnect bandwidths and costs for you know down times it gives you like to put estimates on time estimates roofline estimates and some energy estimates and then you can use this to either go and design your hardware or you can use this to change your neural network or you can use this in your compiler could determine the best mapping so you can kind of have different use cases for this. [00:14:19] And so the key benefit in this are the key. Component in the stool that is novel is how you describe data flows I mentioned that there's a lot of ways of you know partitioning the whole neural network 70 space for the result space to how do you actually define and describe this concisely so we've come up with a set of pragmatist that you can use to describe these data flows so there's a very simple example if I have 2 loops I have a couple of product months these are actually very similar to open M.P. pragma if you're familiar with those so you can see that you know I want to spatially map this data S. and I want to temporarily map this into rated X. the parameters are the mapping size and the offset and importantly map means you know map the same loop variable to each be spatially map means map different to people to each be and again from a compiler perspective you know this is similar to the application and block cyclic distribution if you if you are familiar with those terms so the idea is now if I have 5 P.E.'s in my system map X. in this manner and so this is a mapping size of one S. in this manner mapping sizes to suit 2 entities get mapped with an offset of 2 and over time X. will increment and this will increment and if I had an offset of one you can see that 0 and one would go here and one and 2 would go here so you can manage things like you know sliding window behaviors that are common core relations so using a set of pragma this is just too pragma as we have a large set of fragments you can essentially describe these loops Why are these pragma and then run an analysis that tells you you know how much of the use you're getting and do a full design space exploration so this is what takes as an input it takes an input for the hardware resource description it also takes the input for the description and the data flow that using these plug must be just sensed it tells you what is mapped importantly what is map spatially and using all of this you know run some magic can give you graphs like this. [00:16:19] It will tell you you know 4 different data flows of these are actually data flows from some of the more recent accelerators that are out there tell you like things like how much buffer. How many buffers do you need how much interconnect bang would you need what's the throughput you'll get for different layers of the networks you can actually see that for. [00:16:37] For example versus the late this is the end we D.L.A. design the bandwidth requirement for a late player is much higher than in earlier for example so you know did this so this can help guide the compiler to say that OK maybe for this layer this data flow is better for some other layer some other data flow is better so you can make analysis like this and you can also put this to an energy analysis tool to give you a sense of energy breakdown you get it to again you know maybe go redesign your memory hierarchy or you're actually accelerate so that's the the Mistral tool and the 3rd thing I'll talk about is the actual microarchitecture for deploying these DNS or the hardware accelerator you would want to use in your inference engine. [00:17:24] So this work is motivated again by a some of the things I've talked about already so they don't need a lot of data flows and mapping strategies that are accelerators need to leverage one is because simply because of the and into apologies so so again there's a lot of different kinds of neural networks this you know convolution in that world as the current networks are different DNS today have different kinds of substructure So Google net has an inception so all of these change the data movement patterns within your accelerator naturally your networks may also be very irregular So these are you know very nice dense networks but. [00:18:01] Pruning is a very common step in training today that might make a network irregular or the networks that you know we did generate using the evolution of algorithms in genesis of very regular because you know new dawns and Leo's edges are getting added bitterly so your networks might be very it regular and in fact you know if you use a tool like Mistral it tell you very very clever ways over the ordering loops and so on which again leads to a lot of different kinds of data flows in your day in an accelerator So while all of this is great all of these have different performance and energy benefits the challenge is when how do you run them on your hardware because you've designed it at design time and how do you acted on all of these on your hardware and the training for supporting multiple he does lose today is actually not very promising so every new data actually dissolves in the new proposal for The New Accelerator microarchitecture So if you just follow the people that have been coming out and not done since is every time an M.L. practitioner or a compiler writer comes up with some new way of you know designing the network automatic strategy there is a suite of hardware axillary to people trying to you know run that efficiently which is of course not a very scalable approach because I would want to design my IP once you know deployed on myself driving cart and not have to go to place it every 2 months or so the question we asked is can we have one architectural solution that can handle arbitrary date of movement and data flow patterns and still provide close to 100 percent utilization in your hardware. [00:19:27] So what we did is we said OK let's take a step back so this is what a typical C.N.N. looks like this is the image processing view where you have input images filters and outputs and let's actually go to the Noodle Network view of this where you have neurons and synapses and so a slide of your toes is essentially moving down in your runs here so if you really know what's happening in the neuron as I presented in the beginning of my doc essentially the computation is nothing but you have a weighted sum so you have independent multiplications and independent. [00:20:00] And you basically get the output right sort of key insight was whatever be the data flow pattern whatever be the neural network topology whatever be the mapping strategy whether its spots are dense ultimately Diskin be viewed at a very fine grain as simply neurons of different sizes so as an example you fired the spots network where you know suddenly this edge went away because this way to 0 this new dog would have 2 inputs while this new dog would have 3 inputs right so suddenly you have 2 important important on that need to be simultaneously mapped and the reason why current designs are inefficient is because you know this this neuron that's 2 input cannot be hunted efficiently so if we view it in this very fine grain manner we can then start rethinking how we should design a system so. [00:20:45] If I give you an abstraction like this if I say that you know in your hardware you'll have a bunch of multipliers and a bunch of idlers and you can create neurons are virtual neurons if you will on the fly so I feed in inputs and weights I get outputs from here and just like a model of color have. [00:21:03] Called for the compute units I see I want to multiply is and to our dollars and I get these 3 multipliers and to add those to get a to the input and you don't then I say I want to multiplies and one Arda I get it do in putting it on here and I get another team working around here so with 8 multiply I'm actually able to map 3 neurons completely without any starting partitioning or designed them so I can morph this depending on whatever you want to mount so that's the vision that's the abstraction we want to give but these are just virtual neurons this is temporary groupings depending on what you're mapping but the question has been how do you give this abstraction or illusion because you know you have to design things at design time you have to put in your resources that in hardware so at one extreme you could just have you know interconnects all over it every multiplier can connect to every yard but that's not going to scale the left thousands of these multiplies and I guess you can probably order of 100-2000 networks and inside this so our idea was let's actually try and design an interconnect that's you want for this application and have a have it be the configurable compiler on time so what we did was we actually saw that whatever going to be the accelerated this is broadly to the kinds of traffic patterns that deep learning applications give you so one is distribution you have to take inputs and weights and distribute them to your P's on your runs. [00:22:25] You have to reduce the outputs I remember I mentioned you have to do a weighted sum so this is basically a reduction step and you have some local forwarding this is where these are exchanging data between themselves as I mentioned this is why accelerators are more efficient because they do local communication so as long as your hardware can just support patents and has flexibility chewed allow you to get that abstraction so this is what we actually try to do so this is our design call maybe I won't go into a lot of detail but the key idea is there's a bunch of these switches here called microswitches that are connected to each of the dozen multipliers and they are just for the 3 part and I talked about this a distribution network where you simply take inputs and weights and distribute them over the factory to these multipliers there's a lower leak a linear forwarding network and there is a reduction network with high bandwidth and it has some additional links as well for some provable non-blocking properties I won't go into so using all of this you can now start realizing that abstraction so these switches are configurable they can be configured by a compiler or by a runtime system or even by hardware and you can create new runs about the resizes here so as an example if I wanted to map a dense C.N.N. I could create neurons in this manner this is where the weights and inputs are flowing and I get full to put if I wanted to do a sparse D.N.A. Now suddenly you can see that some neurons are larger Some of them must follow and I may have to map things like L.S.D. and fully connected layers red neurons are really really huge and the entirety just does one operation so all of these are feasible within the one substrate just by configuring the different neurons and so with basically when we run maybe even some of the state of the art accelerators out there we saw about 65 to 42 percent performance improvements this is just a dense workload with sparse workloads the improvements are even higher and we are working on some of those designs. [00:24:19] OK So to conclude if you think throwing a lot of information out there but just to kind of put it in perspective so you know will be pervasive this is something we all agree on and as I mentioned this kind of 3 challenges to get to be pervasive you really need to design the right neural network you need to be able to map it onto your available hardware and you need to have hardware that's very efficient in order to run these networks and this is the 3 directions my group has been looking at so for the 1st direction we did build a system called Genesis which is a hardware software design of new revolutionary algorithms for continuous learning where the neural networks keep learning and evolving for the problem at hand we have a tool called Mistral which is an analytical model that lets you determine the best mapping strategy for a given your network on your hardware platform and finally maybe is actually a hardware platform with reconfigurable interconnect that lets you map arbitrary date of flows. [00:25:18] With those 200 percent efficiency so that's the end of my doc and if there's any questions I'd be happy to take. Yes let's. Yes So that's a good question so I think. There's probably overlap in some parts I think this is something as I said is. A very active area of research how do you design very very efficient accelerators I think one of the key things that is very different here in my group's approach is a lot of these things are being viewed from a communication perspective rather than a computer perspective so my own background is more on like interconnection communication networks so the essentially try to boil the problem down rather than from looking at from the compute side what are the communication patterns in these systems and then how do you design the system such that each of those communication patterns can be optimized so that's I think one different way of looking at it at least from these end of the things and this one is something that is I think. [00:26:27] Pretty unique I think there's a few teams Oakridge has a team that's been looking at evolutionary algorithms Google has a team this is a domain that's very emerging so right currently deep learning is still operating in the domain where you do have to networks and you have experts really trying to design the network this is trying to you know look into the future where if I really wanted to do things or to know mostly How do I do it and of course there's a lot of other problems to be solved to get to or to know my systems but this is trying to look at one possible solution here. [00:27:11] Yes. Yes so good so great question so. So yes but so that's the hope it's not completely. Right now in some sense that is a maybe more different oxalate a more deserving than me strain it gets feedback from that what we have actually been trying to do is once you once you do a deep dive doing some mysteries very quick it runs in let's say you know less than a 2nd it gives you some estimates you take that data flow then run it on maybe some it is the actual hardware platform that is our you can download you get real estimates and those can then be fed back to the system so they are actually doing a tutorial on this out H.B.C. on both of these tools where by the hope is by that time that floor will be integrated with you can do this analysis too and then. [00:28:00] Simulation get an F.P.G.A. result and then feed that. Good question so I think this 2 ways of saying to computationally the what we saw is competition to this is actually very more efficient than the training algorithms but of course the tradeoff is accuracy so if you really have a lot of data and you really want to converge to a solution just doing back propagation which took us to greatness in Vail give you better accuracy and may get you to the solution faster though each step of the algorithm Ville involves a lot more competition a lot more memory these algorithms are computationally simpler because it's primarily infant the evolution step is just you know mutations there's no gradients very little storage but then the tradeoff is you know how soon you can get to the end that's actually something that trying to explore so if you have more like a lot of agent that have like a 1000000 Nords trying to do this word says you know thousands of G.P. use trying to do traditional training but if this is more efficient in terms of energy and computer I might still be able to do better and that's kind of an open question that something people have started looking at and again that's where I feel that having the systems perspective it is nice because what if I give you custom hardware just to do this maybe you can actually then you know flip the curve and get systems to be right as it computationally. [00:29:26] Its competition more efficient and it can actually match the accuracy of about propagation algorithm that's that's what we're trying to do. Yes Yes Yes So it's a great question I think so it's probably the same onset as deep neural networks coming back now so I think it's it's very similar observations so the same thing that held the networks with a lot of compute. [00:29:57] So of course when you let was it was computer data and algorithms all 3 together I think these are coming back because people again realize that neural networks can give you good solutions so well if then there is a way of using these algorithms to generate better neural networks maybe that can also give you a better solution and again the same benefits when now I can deploy them on hundreds of Nord's computers much cheaper while earlier it was very more expensive to do this so that's the same so it's but it is to most of these algorithms that we're looking at our papers from the eighty's and ninety's that's definitely you. [00:30:41] Talking. So that's a good question so it it actually of function of how much overhead you want to pay for the reconfiguration so all I showed you was just the data for this with all of these trees and there are let's say you know this and of these multiplies and I go through this and logon switches everywhere so if you really wanted to do configure things cycle by cycle you would have a huge control plane connecting to all of the switches which is going to be more expensive from an area about perspective but can give you a very fussy configuration and maybe that's fine because overall you still do better than a story configuration what we do realize in practice though is once you know your neural network and you have to map it like the the nude on the virtual neuron sizes are fixed at least in that entirely it is run so you need to reconfigure the dart gun reality which then means that you don't need to have a high bandwidth network to configure it like a controlled plane to configure it maybe you can just have a few bits that you know just as a scan Jane go to all of the switches so so to me seems like that's more of a cost of you know how what's the rate at which you are in the configure BT So the key cost this irrespective of you know the reconfiguration time is that every compute unit has a tiny switch attached to it which is let's hear you know let's say it's a 10 percent overhead so that is something that is a cost that is added just because it's configurable but the configuration is something you can do NOT the cost to configure it is something you can do and based on your requirement so as a very quick example again I talked about sparsity where OK you know what you're on the different sizes now every time I run the new input maybe there's more sparsity there so if I really want to configure every cycle because now suddenly there's more sparsity there's less sparsity then you need things to be you know configurable at a faster rate or you pay the cost of some inefficiency or some under utilization because you've kind of configured at compile time it. [00:32:48] Thank you you thank.