Yes you know biology is a reality. Researchers here you know he's a director of the center or well as well as the director of Human Rights. You know if I was just to tell your story here. Also here really is long as the awards you use are all you hear all about is as well. My. Don't jump for number of years and very much appreciate his energy in science because he's violent area. Thank you and I thank all of you for braving the snow the ice the rain the god knows what. So what I will do today after Thing one. Is give you kind of an old review of what's been going on in my research group which is you have over here. Our genome. How could be you or could be a microorganism in my view of the world is very protein centric. So we're going to look at the coding regions for proteins which is probably only a tiny fraction of the biologically important molecules but this was where we started no one is interested in. Well what are the proteins doing. If you go to we're average database and it doesn't really make a difference whether it's the database for us or a database for all primitive organism like Michael plans magenta tailing him. You'll see how to take actions for about fifty percent of the functions of the proteins. It doesn't mean that there are experimental annotations most of the annotations you see irrationally functional inference by how much. A is evolution are related to B. B. is related to C. C. is related to the and somehow by a link which is unfortunately in many cases been lost to history you will see an annotation of the type def protein and. Involved in a pop ptosis. But you really don't have any clue. The actual quality of the annotation whether this is an experimental annotation or a computer patient inference and moreover the average job proteome about the function of about half the proteins is simply unknown. And so one of the things one might want to be able to do is actually use to infer at high levels of precision perhaps at the expense of coverage of the function of the remaining fifty percent of the proteins now function has many features. I mean you could be biochemical function you could say here. Can I review. Through a light and I think that's enough what I want to do. You know admit your protein may bind it may bind a molecule. This may be involved in some metabolic or signaling process. It may go to the physiological function and ultimately to the few typical function this class of stuff is expressed and of some gross pay so function is multifaceted what we focused on a molecular aspects of function in particular one of the key questions one would like to be able to do is infer what did what is your biochemical function or an enzyme in an enzyme if so what do you substrate specificity. What is the nature of the Libyans that bind. And. What is the nature of the Course reactivity if anything now. Ideally if you could do this at some reasonable level of accuracy you might want to use this might help you in the process of drug discovery because the reality is that most drugs which actually make it. At the cost of a billion dollars make it through phase one clinical trials that means that they haven't killed the pet rat and at least in small population bases they don't kill you. Unfortunately they don't work. And or you'll see things like Target interactions are minoxidil is an example. KNOX It was originally a blood pressure medication it grows here at least across here in some people a lot of. Well off target interactions and ideally what you'd like to be able to do is look at the entire proteome and to be able to predict. Which proteins into fear if I give you a small molecule with the desired specificity namely I want to inhibit some biochemical process where world targeted really you might be in the situation I have a molecule and that molecule may interact with a whole bunch of other molecules and so what I'm going to do is tell you a computational approach which allows us to use predicted versus experimental structures to predict Vining. Location likelihood that small molecules bind as well as cross reactivity and I want to apply it to a from a collage important class of molecules kinases these are important regulatory proteins. They represent on the order of about a third of all commercial. Drug targets. Now you might say why bother doing structure prediction other than that it's a very interesting question you want to predict the biologically active form of the protein given the amino acid sequence. The reality is is that in an average proteome we know the experimental structures of less than half of the proteins. So most of what we will have is not actually. Experimental structure of protein number fifty but maybe a model a model of a protein which is evolutionary related on the other hand as the proteins get evolutionarily. More diverse from the protein to structure solve the quality unfortunately the prediction deteriorates. So there are less than perfect models. On the other hand if one could use low resolution structures models with backbone squared deviation from natives on the order of about six sanctions. Binding pockets on the order of a couple of things from say over here on this would enable you to cover about seventy five percent of the entire Human Proteome. So even though you don't have experimental structures. For more than in the case of human about half either experimental structures or very highly accurate homology models. This would enable you to cover the entire essentially a significant fraction of the human quota. So the key question is this. Can you lose the low resolution structures versus experimental structures. If you can. What can you say can you predict binding affinity what kinds of things. Can you do with the structure but what you should be aware of is that while protein structure prediction is occupied a lot of activity and a lot of intellectual interest because it's a very simple problem to state and not such an easy one to solve a operational question is what is the value weighted of having a structure. I handed to you simply giving you the structure does not in general tell you what it does. You have to do something else. I mean it's if it's infuriating but it's also true. And then the question is was what was it what type of structure if we only limited to experimental structures that it's going to be a while before we do this. So that's basically the background and so what I'm going to tell you about is a computational protested of Mabel's last week called Falling sight which enables us to use predicted as well as experimental structures for evolution and infer thought in functional aspects by looking at evolutionarily distant not close relationships. Sure. Wow. When you. Are about twenty percent twenty percent. I'll show you in the case of the kinases which are you know I know that number exactly. There's five hundred sixteen kinases in the Human Proteome and structures of about one hundred twenty. That is an indeterminate answer. Actually the hardest thing is to express the protein going to purify and fold. That's the biggest attrition rate the success rates of the structural mix projects are better but not spectacular they lose a lot on preparing material and there's also an issue that a number of them actually don't crystallise and the reason they don't crystallise is because they may not know a call. Factor or it may be involved in an association induced conformational transition attack Sherlock a unique structure written in or have a crystal structure complete to mobile. So the yield is not that high. And typically. I mean you know so for example the most frustrating case of the G.P. C.R. is like you know there's about a thousand human G.P.C. ours and we know the structure for. Because the membrane proteins so those are the worst case to crystallise. And also of course you're about one hundred thousand dollars per structure and now the cost is probably dropped to about about fifty thousand. So it's so the point is that even if you could do it it would be very expensive and in practice you can't do it a lot for technical reasons having to do with getting purified samples that are of the quality. If you can get a protein. That actually folds and is not disordered the odds are seventy five percent. It will be you'll have a crystal crystal structure that the fracs at this point but only about ten percent of all proteins that they try ever make that form. They typically fail at the expression level or getting a biologically active form the refolding in the Constitution is to rate the permanent staff at this point in time. It's unclear. I mean when people try it. I mean you know the problem is that it's not crystallizing because they don't get a functionally active form or they managed to get it expressed getting the starting material. Turns out to be the hardest thing if you look at the attrition rates of the structural genomics projects getting purified protein that's folded is the rate the time you step and that success rate is ten percent. So only one out of every ten they try and they basically try a whole bunch of homologs and a whole bunch of mutations and whichever one folds it's winner take all the. Eighty percent is a molecule models or not solved. Yes that did close proteins it's forty percent of the proteome closely have about thirty five percent identity you can build models for about forty percent using state of the art structure production methods of seventy five percent. So three quarters are the non the non membrane proteins you can build good models for all the drama the majority of domains and Moreover you'll know when they're good and when they're bad. So it's a significant coverage. It's not one hundred percent unfortunately there are there are some spectacular cases where you fail miserably. And they're mostly because they don't have close home a lot of the they don't have any anything even remotely related scripts to lies in the P.D.P. So what's actually interesting is if you look at a crystal structure. This is a chart for purposes of this question and this is a collection of Proteus is what's very interesting is these are revolutionary really quite distant they mean only sure about twenty percent sequence identity. One of the key features evolution is Keep It Real use the same binding site over and over and over again. OK This was originally point that out by Patsy Babbitt and John girl for the Tim family which is a very old family trying to foster some races and relate. It turns out to be actually very very general. And there are features of the binding site and the chemical nature of the lagoons the body and that are very very very strongly conserved. There's also things that vary and so you can actually use this to do functional inference and if you do it right you can use low resolution predicted structures. And that's what I want to tell you about and that's the idea behind find site so the idea is really very very simple if you look at the binding side of a protein so this is going to finance transferees one of the questions is what is the plasticity of a binding site of experimental structures that bind exactly the same we're going with exactly the same functional. If you believe the mythology. They surely have in our midst. The of zero there's one structure and that's it that's not even close to the truth. The average or Misty is that of proteins that bind the same league. And you look at the back bone is about two point two and closer and it's an instrument half so the binding pockets themselves are not rigid and structurally unique they're plastic and this is an important thing because you can take advantage of this and use the structural defamations and plasticity in the low resolution structures which reflect the physical class the city of real structures. You can see that this kind of subtle variations and this is what you do so if you say that you have to have a structure which is one unique answer that's not the nature of those natures plastic it's a certain plus the city and as long as you're around that threshold of plasticity. In your models that's all you really need to do. So the idea is really very simple. This is an engine. OK So this is you know look ma no hands we take a sequence. It could be an experimental structure or a model structure we thread. Threading tries to identify evolutionarily this didn't but structurally related proteins. To get the template which is this predicted structure based on some solved structure we can move them around that are just that. And then we cluster the binding pockets and we look at the properties of the begins the bind. So one of the things you might want to do is say which residues by OK and what's conserved. What are the B. factors of the residues are they conserved. Liggins are the features of the lagoons that are conserved. Or is everything variable one view is that it's just for is all over the map Another view is that there's actually substructures that a conserved throughout the course of evolution. Plus a variable region which imparts the specificity. Which we will actually use another thing you may do is you want to very dumb algorithm for enzyme function for instance count the fraction of evolutionarily distant but we're related templates that have a given goal molecular function. And if it's more than fifty percent just transfer the function. That's a dumb algorithm I mean you could make it more sophisticated and obviously we've optimized the good it turns out that the dumb one works quite nicely. And so well the first question you want to know is how well does the model war and method work on high resolution structures experimental structures. If I have an algorithm that doesn't work on a structure without random errors it's certainly not going to work on one random errors and then the question is since I cannot predict the structures of the entire Human Proteome at experimental resolution. I can predict a lot of low resolution structures can we use these low resolution models for drug screening functional inference and so this basically opens up a whole new field if you can. So you don't want to do this on the statistics of one you want to do it on the statistics of many and so we took a representative benchmark set of nine hundred one nano Mahler's proteins between fifty and four hundred residues. We threw away a clue evolutionarily close proteins because copying two proteins that are evolutionarily closely related. Is trivial or not very interesting. Now another fact. There are many evolution sequence space methods that work quite well as the proteins are closer and closer. You know tard. In template they work extremely well as they start to diverge they tend to work less and less there is no reason. If you're interested in functional inference in informing the function of two proteins that are ninety nine percent identical the odds are they have the same function you don't have the structure. Structure can play a value in the limit of low sequence identity to proteins of known function not high at higher levels of sequence identity you can just copy the function with appropriate thresholds that are family specific we built algorithms that do this which are actually used by the road for us as if they figured out the genes they use to try to figure a quarter the enzymes. We want to be in the regime with a sequence identity is low. That's devaluated of structure. So we want to throw away all closely related proteins and you want to have a reasonable number of legal battles and the question is how well do you do with finding the binding site. Where I know nothing. I just want to find which is important then I have a look at Tell Me Work points. So here I have the results these are the crystal structures. This is fine site which is our algorithm this is link site link site is a pot binding pocket the text an algorithm looks for the hydrophobic city does it have an appropriate cavity it looks at the shape blah blah blah. You know find one of these things. Now this is great. This isn't bad. I mean it's a little bit worse and you know I'm basically a reasonable threshold as you see is about four strips. What happens when I go to predictive models now predictive models they have an average or are misty of about five range terms. OK so they're not perfect. OK so we throw away closely related proteins you see the performance drops dramatically. It's around forty percent for. Luke site because the pocket shape is distorted. On the other hand it only drops from seventy one percent to to sixty seven percent when I use a predictive model. Using find side by looking at basically consensus binding sites you can look at the most populated one. And generally the most populated one that's the incest or one. Across And then you also see functional diversity and you can actually extract there's a lot of richness here that I don't have time to tell you about about how evolution works well. Amongst the other things you could do was you could just look at the ligaments and you could say Let me look at consensus lagoons let me throw away the native ones and what's the ranking and a representative library this is a library of kind of drug like compounds in the top one percent and what's the in Richmond factor. And the top one percent enrichment factor one hundred means you got everyone and which are a factor of one means you should kill yourself you're basically random. OK And you see that there's a significant in Richmond in this is kind of a baby test for native likely goodness simply looking at you pull the leg and you throw away the ones you're looking for obviously it's jackknifing and say can you recover them. OK. There is a consensus and I'm the question is is what's the consensus of the living to the point even though the revolutionary distance these average target templates have a sequence identity of twenty percent. So they're not close they're actually quite far but they have the same global fold and they share a common binding site you can also look at binding site accuracy I just want to call attention to the fact that even though the globe alarmist lurking somewhere up here. OK this is. This is the by I just want to call attention to the fact that the binding site even want to have a model whose binding site. Globally whose global arm is these dating shows the binding sites tend to be very strongly conserved. There's an armistice of about three and the reason is because how threading works is they all have a very strong evolutionary term for those you who are aficionados they have a profile profile permit look at the cove your relation in basically the sequence profiles of the target in the template. The last thing that goes in the signal is the binding site. And so the binding site is actually quite well modeled on the rest of the structure is garbage. And so these models since they have consent how it works is a look for consensus contacts and reassembles and recapitulates them so the binding site is fine. The rest of the structure is garbage. So you can actually do. This. I'm pretty crummy structures. Now one of the questions you might have is how confident is my binding site model I'd like to be able to say to one experimental list before they waste any money and time and postdoc a graduate student. This is a confident prediction. Even though I don't know the answer and this is not and you can do this. So if you have you know a lot of templates that find the binding site these recall easy. You know there are there are small coverage but there are but they have a ninety percent hit rate if you have a significant number somewhere on the order of twenty five to one hundred twenty five. It's got a seventy two percent hit rate so fifty six percent of the targets would be confidently predicted and the hard won since you're looking for a consensus location of the binding site and consensus properties if there is no consensus. You have no clue. And so a lot of templates that are diverse set it gets better. Now amongst the things you could do is is you could just simply look at the templates count the fraction of the templates that are revolutionary distance have a given goal molecular function I know goal isn't perfect but it's as good as anything just for a kind of very preliminary assessment and so I crossed the entire set our average Matthews correlation coefficient is point sixty four with the precision of point seventy six in the sensitivity point fifty four. Let me show you some useless predictions. It's an enzyme Well it's nice to know if it's an enzyme or not but that doesn't really tell me much. What kind of enzyme it is the Thousand say on the other hand I might have a peroxide a second T.V. D.H. far activity can be quite specific this stuff is for those of you who know something about this. This gives you actually with quite high accuracy and precision. The general if it's an enzyme the general class a chemical reactions all three c digits and doesn't give you the substrate specificity. Because remember what we're looking is across evolution is what is consensus. We're looking for the thing which is conserved not what varies. And that's basically the three stooges So this will. Give you three C. digits with reasonable accuracy and precision. So you look at this and OK so you can find binding pockets and low resolution structures even though you don't know anything about it other than a sequence and we've basically folded. Why is it working as a teaching us anything. Is this magic I mean what is it about Liggins and about the binding pockets that makes this thing work. So there's a very simple idea. Yes. It was yeah quite crystallized the hollow proteins in other structures not in the in the target structure of interest. I know what they are and sets of proteins is Hours per hour sequence identity to the target protein is twenty percent. Sometimes very well sometimes not. Well little it depends. So for example and I'll tell you why. If you just give me a minute. It'll be come apparent why because what's conserved across this evolutionary set is a sub structure which we call an anchor. Which sits in the binding pocket. It could be one hundred percent. It's a team. You've done. OK. It could be five percent of the Likud. So there could be eight residues in. An eighty or ninety or one hundred fifty heavy atom ligand that are absolutely conserved it could be the tail of the dog it's conserved not the dog itself. It's geometries absolutely conserved. And the other regions of variable and it has a lot of concern features. So I can't tell you our priori you know after the fact by looking at the degree of conservation you could say you know it's binding on a D.P. OK Obviously obviously if there were no binding of binding A.T.P. I'm done. OK it's an A.T.P. binding protein. The case is. Just a tiny fraction I'll show you the variable in the conserve region for the G.S.T. type families in a minute. So in general I cannot use it trivially for function when for instance a good question. It depends upon the degree of the anchor and then this variable part. And so those two features is the ligand and the protein. So there is a anchor which is the conserved we can stop structure whose chemical substructure is conserved those you know anything about drug discovery typically even the medicinal chemists have discovered this because this is the region they don't vary. Which they discover by trial and error we can predict that and then there is the variable region which is the region that do a very which is responsible for the substrate specificity. And to be anchored is also a set of residues that buying the anchor. And one of the questions is are they more conserved an average less conserved are they more mobile less mobile and then there's a variable regions. So this is the idea is it true. So can we find it. How conserved are the variable in non-variable regions as the model is the geometry conserved because if it's conserved you give me two proteins that are fifteen percent identical I got to do is copy the anchor position. To dumb link undocking algorithm. And then I have to minimize the tail and this is actually very important because the truth is existing all Adam docking functions are basically are nonspecific going to dominated by the van the world's terms and they don't care. They depend only on molecular weight and their invariant whether I take the lead in the right chemically permuted just not busting the valence of the bond you get exactly the same binding affinity it fits. So there's a real problem there. And so now what we're going to do is. Is we're going to just generalize the same idea find sites we're going to find these look in templates and we're going to look at the sub structures in the consensus by. Emotes. So does it exist what's the fraction of the structure of basically two percent one percent one hundred percent as if it's always one hundred percent of would be great because I got to do is copy it and so there was a set of that nine hundred one. There were seven hundred eleven structures. We could say something about and there's also a set of low resolution models which was extrinsic. And so let me show you look so this is an interesting matching problem find the chemically conserved substructure when I have no clue. It's like a structure and chemical matching problem. It's not solved exactly. But we have some interest the color that we stole from some other people a bunch of them. So this is the average molecule size to this is the number of functional groups and so the other thing is we want to describe the lagoons not at the level of heavy atoms but of the little functional groups Carbondale group and my group see it. Methylene group. Hydroxyl group not an atom is what is functional group so this in them is a function of groups. This is the average rank or size is the number of groups so this is one hundred percent so this is the an interesting case completely concert. This is fifty percent. A lot of them. You see are between fifty and seventy five percent conservative or half of the molecule has this conserved chunk and then there's a variable reaching. And then you see some guys that are less than twenty five percent the somewhere where there is literally a little tail which is absolutely concert look Chuck and by the way John Ritter is conserved in space too. And so let me show you make this a little bit more concrete we should turn these lights now because there's no way to make this thing. Off. This is so much better. The white is the anchor region. This is for Google Finance transfer race the blue is the variable region and we're also going to look at regions of low and high sequence entropy. High. Of the. Residues the touch the look and. So what you're seeing is that this late in this very inward so it's hard for you to see and I apologize for this. It's on my screen but for whatever reasons not showing up the backbone the of the anchor leg and is very strongly conserved in this variable region as is moving all over the map some of it's over here and actually the other confirmations over there so it's kind of rotating around in space. So let me show you an anchor and let me show you some variable regions in the in the glue to fire transferees family. This is the location of the variable region and you see some of them a bigger some of them a little are it's the this is roughly in this case it could be ninety in ten percent and here you have some of the nice little banging pail. There is all over the map. So you see a whole bunch of these things but what's very interesting is this conserved region its geography sits in one spot and the structure. Absolutely. Conserved. What's actually even more interesting is I've blown this up. The red is the low sequence entropy the sequence entropy Not surprisingly the amino acids that touch the conserved anchor are much more strongly conserved the variable region not surprisingly it's variable Liggins variable. Residue composition of the protein but what's also interesting is in this we didn't expect this these are the experimental be factors. Red is a low B. factor green is high. B. factors. Protein mobility in the neighborhood of the anchor was is much is greatly reduced relative to the variable region. The variable regions on average much more flexible if like the protein is growing on very tightly and this is true not only for G.S.T. This is true across the board for all seven hundred. Eleven families it's a general trend which is observed. So how do you do doc. Basically I can show it to you. Schematically But the idea is very simple. I have an ankle pose. I have a. I have to figure out where the league and matches the anchor the substructure. I docket by how Monica restraints and then I minimize the rest of it. So that's that's basically what I'm about to show you it's a very quick and dirty algorithm which does as well as anything that's basically all that this says is that you just kind of minimizing it is go to the next one. And so let me show you what happens when you do this. So this was the initial pose and nice just to seriously post sent minimisation it with a hammer and in many cases the models don't improve they're in a roughly the right place. If the variable region is not too big. That simply by minimizing not even relaxing the structure you do better. And so I can show you some cases let me show you the cases on the crystal structures. So now let me look at so there's a couple of cases I want to deal with in terms of heavy atom are mostly from the crystal structure so this is fine sight this is the average armis the with minimisation on crystal structures. This is auto doc. So you see that if we happen to have crystal structures. So this is the this is the benchmark this is not a bit algorithm for when you have a crystal structure of recovering the arm mystery. What happens is not surprisingly is if the part of the coverage is partial So we place them all in the same center of mass. So we given the same hit hint there in the neighborhood it's a randomized confirmation to center mass is a place that the predicted center mass of the going box. So everybody starts from the same initial state then we and they do whatever. You see that if we have partial coverage worsened in this case or the Dock is a little bit better and low coverage auto doc is going to do very well and by the way five angstroms is random. So if there is no information and you minimizing with respect to a position. You're going to get a random. If there is information you're not going to get better than random. Let me just show you some cases. OK So this is the initial fine site position we simply minimize so it goes from three point seven angstroms to one point seven zero doctors about two point seven. Liggins about six you completely gets it wrong. This is a protein specific version of our potential I want to talk about that today. Now how do you know when you have a good looking binding post predicted when you don't know whether you have a good binding close predictive there were certain statistical measures one thing is the anchor coverage. So if you have a significant fraction of the anchor coverage. This is not the fraction of the anchor that's in the Lincoln. This is I have an anchor that fraction the substructure matches the anchor. OK so if the chemical moiety has better than ninety percent match to the anchor what you see is your average heavy atom are missed. There's going to be about two weighing streams. As it drops. It's going to grow. Another thing you could do is how well conserved is the anchor post itself I have a whole bunch of structures. If this anchor is very variable. Docking to the centroid position of the anchor is going to be not bad but I'm going to have a spread number to do poorly. If the anchor is very tightly clustered in space. Low pure wires are misty The resulting. Docking accuracy is very good so tight clustering very good prediction. OK. Poor clustering poor predictions and then this is actually the this is the post diction this is the accuracy of the predicted pocket distance and this is the docking accuracy so when you very far away. You do terrible on the prediction if I get the wrong binding pocket I'm going at the wrong insert when I'm reasonably close to the three extremes I can expect the model who is on the order of four Angstrom Armas to be a better. Now your next response might be for those of you who are used to be told a problem that's not good enough. Ranking actually it's very good. It's on the drugs database our enrichment factors are five the best Come on Chris on predictive models the best competing one is about one and a half. And we can get commercially. Now I'm going to cheat and I'm going to cheat is. When you have a lower resolution predicted structure the placement of the site change having atoms is not great. The accuracy is quite low. And so what I want to do is ask a slightly different question what fraction of Compaq's are preserved and I'm going to ask it as two questions. Question one is do I get the binding residues right. So I've got the absolute wrong leg and touching residue thirty what I got but it's touching residue thirty and thirty is the correct mines and then I'm going to ask the question. Do I get the native Compaq's right. And here's where you see the big difference between the model so this is the set. This is full coverage partial coverage. This is for absolutely full coverage this is the average. OK. All targets this is the ones with easy targets confident and this is not so confident basically in what you see is that for crystal structures they are all about the same for the nonspecific residues. But what happens when we go to models. When we do find cycling recover about sixty percent or so in the best case for the confident predictions in the crystal we only go down to forty nine percent in the model. Auto doc and league in drops from sixty percent to this is random by the way thirty percent is random. So if I simply take a look at it and spin it around with the center mass it will touch about thirty percent of the residues by accident. OK so you've got to do better than ring them so what you can see is good but then you Nisshin in the either the identification of the specific binding residues as well as the native contacts is much too. Less using this approach. So it is much more fault tolerant which is. A cynic or none for using low resolution predictive structures. Basically it doesn't really make that much difference. And why is it working. When not working. You see if we look at these models this is the predicted all out of our mass they. This is the binding pocket all are resting and this is the fraction of proteins what you see is all most of the proteins. Have in our midst. Be bigger than two angstroms and most detailed atomic modeling procedures very quickly because they depend on the shape of the on blowup deteriorate. Since we're using evolutionary conservation of binding poses no substructure binding poses. It doesn't deteriorate at all. And that's why we can get away with it. This is vintage of what we're doing is we're copied. The advantage of what we're doing is we're copying what copying from distant cousins. So now let me show you what happens when you let this thing loose H.L.V. protease inhibitor so we sprinkled this with a thousand and eighty nine known H R B protease inhibitors. In a I think this was a million compound. It was a two million OK now it's eleven. This is the ideal enrichment factor. This is random. This is the I kill myself regime. This is the result a fine site. This is fine site out this is the combined results so far inside L H M and find so it's a basically just a fusion of ranks. If you look at the ranks using find site you look at the ranks using find Sidel H.M.. You just some the ranks and then you resort them and what you see here is we have highly significant Richard factors of forty. Encouraging but this was for a Chevy Proteus the statistics are one. What happens when we let this thing loose on the Empire human kind. So getting to the question of well why bother. Why bother using predictive models. So the first question is the human kind owners of pharmacologically very in. Wharton class of Parkinson A-K. it's you know after G P C R's it's number two in terms of numbers of. Proteins targeted. And Mark asked well you know why not simply use predicted structures what I would argue this is as good as predicted structures in experimental structures this is the percentage of the kind of so this is looking free. So this is the number of proteins nano molecules probe you see me not identical proteins This is the number of structures that was sorted OK this is looking free. There's a number of experimental structures. This is some of them have fifty structures a lot of structures some of them have very little. And then you know some of them have very few only on there's only one hundred twenty structures in the new entire human kind that a solved in the P.D.P. So the rest this five hundred sixty million kinases we've got a model. And they vary in sequence identity from quite high to twenty percent. For the closest guys. So this is their methodology. So we take a kind O. we thread. We do structure refinement and rebuilding of the loops and gaps. We apply find sight to find binding pockets binding residues look in templates find side L H M gives us the anchor substructure of the Libyans and initial binding pose. Cue Doc L.H. am uses the suitable pose on a protein look and specific potential which is knowledge based and again of course obviously were thrown out were all identical proteins or closely related ones actually in the original calculations we threw away the proteins more than thirty five percent identical but for the point of view of doing course reactivity obviously we're not going to throw them out the results don't change by much. And so what we have here we're going to screen against this NG So for the library. We're going to do Lincoln ranking by their fingerprints that's what fine site gives So this is kind of a tiny medical phishing that's like a link in profile we combine this with the anchor coverage what fraction of the anchor and then we also in. Had this pocket specific potential which is a potential which describes the statistical preferences in this family for a functional group X. let's say CARBONELL to bind to lysine thirty seven. So it could be different than life seen forty five. And so the first question is just how crummy of the models. Some of them a very good. So basically when we throw away the closely related proteins and we look at some of them have to understand what about these guys that are aging stream or missed these guys have a I have this sort of tail. That's waving in the when the way from the binding site and it's just place the drain the. And that's why it's has a high on mistake you can look at other things like the large there's a thing called the T.M. score this is basically tries to get rid of the problem. Of variable tails and basically weights the most structurally similar trunks more than the structurally dissimilar chunks so it gives you the biggest chunks that fit a T.M. score it's a length independent. Measure of zero point three is random. Bigger than zero point four statistically significant. That's got a Z. score reel of three in them of ten. And so these guys have T.M. scores on the neighborhood of. In the eighty's these are very very good models that were predicted and it does a really interesting Lee enough the a post structures are predicted a little bit worse than the hollow. But that's just a minor detail in this game. And this just shows you that they're reasonably good models how well do we do at finding the binding site. So there's two things that are two different ways of doing it. Depending upon the databases we use this is this pocket distance of two extremes more than half about half have a pocket distance of two extremes or better. For the magic number. Operationally it's about ninety percent so you can find the binding site and this is the Mathews correlation coefficient for the binding residue site residues you find the right ones. When comparing to the crystal struck. You know you didn't put in the crystal structure answer and you see most of them have a Matthews correlation coefficient above point seven This is an easy family. It's a very rich family. What about looking docking so we have different look in pairs. This is so this is binding not of A.T.P. to the A.T.P. buying pocket. This is the binding of the native league and when we throw away the native league and and any echo of the native league and so we cannot be in the database used to derive this. So if let's say looking fifty seven occurred in protein and. Another kindness it's thrown out. So the not identical lagoons identical lagoons are excluded and what you see here is a couple think this is a very complicated slide this result. This is transferred. So this is if I do the best structure alignment and I simply copy it knowing the answer. So this is the theoretical upper bound of how well I'm going to do. OK this is a fraction of complexes. This is the whole methodology where we basically this is this. Q Doc L.H. them we've basically used to find site consensus predictions of everything pocket specific potential to position it. So you see that there's a significant improvement between the. Find site. And using this pocket specific potential to refine the post to get Exactly. You know the binding residues in these of the specific contacts or what doing big you know we're doing pore on the specific packs then the nonspecific going to come back but we're certainly in the neighborhood. What about ranking. So if we look at A.T.P.. This shows you. This is the cake compound rank it's ranked around one hundred. Now remember if we just simply look at a profile you can't improve it. It's a generic thing and it's are essentially averaged over the family it doesn't know. When we had Q doc which has this protein specific thing. What you want is rank one so. One A.T.P. to bind the A.T.P. binding site with very high rank look low number I want to be at the first position and in a hundred fifty cases it is and you can see that in a lot of cases it's very good in some cases it's not so good and this is so basically you can enrich. The results so you don't do bad for a ranking of known Lincolns But what's more interesting is. One would like to be able to predict the course reactivity because one of the problems with a lot of drugs which inhibit kinases is they inhibit the kinds of interest and a whole bunch of others. And so what we want to be able to do is to apply this methodology which we have done. Put kinase rank order. So we have our million compound library rank order them. And so I kind a study. You better worry about kind is fifty there's a probability that you to get across react because you have similar looking binding profiles and you have similar other features that are going to generate a course of a probability of a course reactivity we call this a C.R. index which you hear is between zero and one. It's basically was trained by you based classifier what I'm showing you here is this is the. Phylogenetic Tree. Of all the kind. So this is the route and this is the evolutionary distance and these are various kinds inhibitors and this shows you the strength of the this a molecule. Let's see let's look at this when. See I ten thirty three. OK So you say it quest reacts with a whole bunch of guys a revolution a really close. Who cares that you'd expect to get across are actually describe of course are actually going well what about bees. It's interacting with this guy and these are evolutionarily very distant these are far away you want to go to predictable these course react files. And so these little dots. Show you the strength of the cross reactivity the binding affinity of this molecule across the kind of family tree so what you see there's a lot. Course reactions. Can we predict. That's the question because if we can predict a minute of the very least we could identify which ones you want to optimize to minimize side effects of such a thing is possible. At the very least you ought to be aware of what your inhibitor is going to interact with if it's an essential molecule. Let's say responsible for a life that you're going to inhibit in addition to your target. You might want to consider another molecule. Because obviously that's not a good idea. And so if you look at the sequence identity. This would be the course reactivity profile organizing the kinase family just on the basis of the sequence identity and blue means they're likely to cross react. So the dumb thing is to proteins or revolution are really close to each other they interact. There are illusional are far away they don't interact. And this is what you would get simply looking at sequence identity. This is our predictive course reactivity profile. Well across the map you see strips of interactions obviously used to recapitulate pieces which you see a lot of stuff off the diagonal. That's the most interesting stuff not the stuff which is on the diagonal which is the family relations with the stuff which is off the back. And so basically what we have done is a very simple naive classifier which incorporates anchor conservation pocket match score. And a correlation of ranks are you familiar with Kendall pal correlation coefficient. It's a very simple thing I take. I have to have to talk to protein targets. I have two million compounds I rank order them from one pan. OK. All the molecules in the protein target I rank order them from one to ten and then I ask the question do I tend to bind similar molecules or a similar rank. If there is if the molecule has a signal. You would expect that similar molecules should be equally likely to pine if they're going to course react and this similar molecule should be equally likely to be put very low in the ranking number eleven million and one in one out of eleven million. And so basically this is a way of combining these things as well as in this is that the pocket is a geometric similarity score putting. Because obviously if the pockets look very different. It's not likely they'd find something similar. So it's kind of a combination of a bunch of geometric factors and. Anchor factors and so what you see here is this is the rock curve this is for the false positive rate for binding affinities what you see is. Why is equal to X. is that is that the were random and many of these were actually pretty good and the basis of different BIO Ses. OK a few molecules. This is the rock plots for selected inhibitors you know how well do we do with expect experiment the course the family feel kind of says and you know only what you see as some of them are not so good. Some of them a lot of some of them are pretty good some of them are in a reasonably good. So it's not bad. But. Let me tell you about the I'm very grateful to our what we believe to be Paul who is at Glaxo Smith Kline when this paper was submitted. We were aware of a set of experiments bill. Looking binding profiles for kinases there's two hundred three kinase is in there were five hundred or so five hundred six compounds that are supposed to represent kinases space and they screened all these kindness inhibitors against all expect to see these are experimental binding studies. Against all these molecules and they looked for course reactivity we didn't know anything about the signals we would disclose this was a truly wonderful blind test and then you can basically this is how you do this. This. The fancy way you can construct a star similarity profile which runs between zero and one which is you tend to interact. I'm going to predict that these guys have similar profiles they're going to interact. And so the question is this How well did we do with the experiment. This is a useless plot here. This is the correlation coefficient between the SAARC similarity score in our course reactive score. What you see here is is that our average correlation coefficient is point five with a lot of them above point five these are there's a few We really screw up on. And it's interesting. Why. And we don't fully understand this but a lot of the cases you know we're doing quite well. I mean this is al C.K. this is the L C K family this is actually got a correlation coefficient of point seventy one. So what you see is we're not doing badly at suggesting. In this family. What are off target proteins that your kindness molecule number sixty seven. You better be aware of. OK. These have a tendency to course react. And this isn't in so called method and obviously in silicon methods are no substitute for the experimental validation. But what we have at our Web site is that is that any academic can go and say these are the ones you ought to worry about you know what to do certain peeves how you are that they're going to cross react basically and so you can go through once and see an average or a correlation not coefficient with experiment this point five on this extruded accent of experimental data and what's. You know and since we didn't train on these molecules we didn't train on ANY of it you know our hope is that this is actually generalizable and would be useful to the committee. Let me just conclude by showing you a little bit about you know we have we built this protein structure and function prediction resource which is actually the tip of what is a much large. Project to basically we have about fifteen different programs that do different sorts of things that I'll tell you about in the end but the idea is very simply should be really user friendly. So you could take your sequence you could punch you didn't you could do enzyme function for in Salone efficacy was an algorithm developed by the late age or an era cocky which is very very good at it has a very low false positive rate. So all the methods which we develop. We're going to sacrifice coverage at the extent of minimizing the number of false positives. So we make a prediction we want to be right. I'd rather make fewer predictions that everything is a histidine kind. And which case it's a useless prediction. OK. You could also it will give you the predicted structure and also give you find site. The other things you can do is if you happen to have a structure it can tell you if the thing is likely to being D.N.A.. Actually now the versions that are on there will also tell you if you give us a sequence will predict the structure or we can tell you what happened from the ending. And so what comes out is you know is this is the prediction you'll get a threating summary set of models find side results after crash results. You know it's kind designed to be extremely user friendly. So for example in threading summary I mean you know this is. To Very nice. These are the this is the this is the Z. scores one importantly we give the user a notion of confidence. So you don't have to know anything about this method you don't have to know what disease score of six great or terrible. We're sixty great or terrible will tell you it's high confidence a prediction which is that everything has been extensively extensively validated on Representative benchmarks of the entire P.D.P.. You can then look at the model says Iraq. What are the various prediction results for the individual threading algorithms give me a nice set of models I can predict the models here's a model that come out so we build a full length model that gives you the results and this also I come from in score which tells you how likely close the number. X. is right you can then pass these to. Find sites it'll give you the look in binding pockets how many there are what those that binds to this is a representative screening of you know and this is a disease score of the energy that this thing binds relative to random. Because one of the things you can do is you can take a look and then you can shuffle it. And on like the comic models which fail this test if it fits. This one gives you not even bigger because the scores actually mean something. And so I've taken you through a little piece of what's going on in our group actually this is this is not even representative now what's going on. So the idea is if you take a sequence you want to threaded through a library of known structures get an approximate model relax the structures. Use every with distantly related but evolution or related proteins to predict binding sites. Small molecule begins the biochemical function of the protein. You can also use this for multi Meric proteins there's a multi marriage version of this algorithm that allows you to predict protein protein interactions it all so it's very. It suffers from a coverage problem because it's threading based it uses existing structures but it has a low false positive rate less than twenty percent of the predictions are false positives. Alternatively you can actually go through the if you're interested in just. And as I'm functioning for instance with cool with quite high precision this efficacy will work. So one respect the protein function itself. Hindsight is basically threading based threading means it's driven by evolution. It's very interesting how these things work. The location of the binding sites conserved now actually what's really cool is in some proteins that will bind the site is really really very conserved very very deeply across several ocean in the same location in the structure sometimes you can actually watch the binding sites move. It kind of more. At least that is the suggestion of this might have been have a family has evolved. This is a suggestion this is not a temporal statement. What they tend to do is they tend to conserve anchors. Now the nice thing about this method is on like all other competing methods out there is we do not require a crystal structure for crystal structures or at least this competitive as the you all out of methods. But for non-critical structures were far superior because we're not dependent upon the the little details. A lot of all out of methods fail. When you do a thing called cost stocking. I mean a lot of them are assessed by I take a leg and I rip it out and I dock and I fling get in. So it's just the negative template How well do I do. They tend to do poorer as the structure of the story. I have the same protein but let's say it binds ligand thirty and Lincoln fifty and I swap them so I use the structurally going fifty to block and thirty. That's called Quest docking they do poor. When the time they're at three Yanks trims basically the results are written. Our results are just fine. The diminishing and binding side of benefit cation is four percent not a factor of two. And about two thirds of the. Models of the target proteins will have a good binding site critical we know which two thirds they are simply on the basis of the number of templates in the behavior. You can do dumb go annotation by basically most populated If it more than fifty percent of the templates has the function transferred and that gives you across all functional classes. For whatever this is worth with an average Matthew correlation coefficient of point sixty four. Why it works is. Evolution has conserved. Substructure of the lagoon which is then decorated. The location in the structure of this anchor is very strongly conserved. The residues that touch it. Are very strongly conserved the B. factor is the mobility of the residue the touch of the lower switch grabbing on very tightly. And this can be used for a very quick. So we can screen millions of compounds and on the reception on a desktop because it's very quick because what you're doing is you're once you figured out what the anchor is that the most expensive part of the calculation is the anchor matching piece. Because all you're going to bring is equivalence in the minimizing a set of harmonic constraints and doing steep descent minimisation to remove the Omble of and then you get if you want to get a couple of minutes of computer time you can relax the pail and that's what we do so it's a couple of minutes per compound and if you have enough cores you can screen ten million compounds against the entire pool. Which is what we've gone over so up to six hundred residues. It works OK for protein models whose backbone are Misty is four to five gangsters because the event horizon for. Nature same leg and. Evolutionarily distant but related proteins the armis the of the binding pocket is about two point two inch from plus or minus an instrument half which is exactly what we're getting from these crummy models because that's actually what we're capturing when recapitulating the diversity of the binding sites. You can use low resolution models for virtual again screening the nice thing is that the ligand binding the fin in the rank is not correlated with molecular weight. If you use in a lot of potential if it fits bigger Will buying better. You're for have a higher rank. It's not true. So for example in the drugs database which there are more or less the same molecular weight. The average enrichment factor. I think Doc was like one point six. Ours is five. And that's on credit. That's an experimental So that's some predicted structures that's not an experimental structures. You can. We built this thing called course react kind which basically used the structure predictions. You can go to our database if you're an academic you can get the put binding poses of all. Two million come. Once I'm there. Ranks. You can prioritize allegiance that you like because you know the zinc database is a representative set. You can also look at course reactivity results which looks not only per protein target but across targets looks at the binding pocket similarity. And the results are strongly correlated with experiment. So this is a new way of doing drug discovery because not only do you design for what you want you better design away from everything else and you can define things like perm security in this you give us a molecule of will screen the entire Human Proteome well up to six hundred residues and will tell you how promiscuous your molecules likely to be and the some molecules that basically bind a high rank of everything I'd suggest that's a poorly could well there's a much more specific and we can have we have lists of what are the off target. And let's say we're even off you know but but you have thirty of them. You know you don't have hundreds of them with thousands of them. And this is all available with respect to cipher which is a protein structure. Function inference database and basically this is designed for the biologist or the it is the sequence and or structure and it'll give them functional again binding screening biochemical function. It gives you the live concert of the original invariable regions. We have a database of all structures in five hundred sequenced genomes you can play with this you can go to US We have a database for the Human Proteome we have these two million off target compound librarian so let me just acknowledge my collaborators. Michael Berlinski is a very talented research assistant professor in my group he and I have developed the whole fine site generation one hundred G. SHA she Adrian Michael and they were involved in cypher. One hundred years been very much involved in this. Sure prediction part of this exercise which underlies I did talk about it today was basically the ability to go through the entire proteome or a significant fraction of it. And of course D.N.I. in the Georgia Research Alliance who paid for the little people in the cluster that actually enable you to play this game so I thank you for showing up and I'll try to answer any questions you might have. Now we didn't make any difference. Couple of percent. I mean unfortunately you know we sure did. But you know you could argue that our describe. Even the input data sufficiently noisy that you know we review we've squeezed out within a couple percent more. That we try to whole bunch of different tests for yams different classifiers different architectures different kernels. But then I bade works just as well. So that's what that's that's the one that's up there. Yeah yeah. Exactly exactly. It doesn't make much in these cases you know you could do it by brute force or because because we kind of know where it's going to bind so you know you basically draw a sphere around that you're pretty close to the answer. So I mean yes I mean an end to some extent because what copying and this really is because we're copying I mean Sharon I wish I had a physics base potential that worked. But I don't and neither does anyone else. And so copying is good because we're using evolution to tell us but we have. Something which is close which you may find very interesting because your interest in anybody facts. It turns out that if you look at the anchor. Look and. And you look at the heavy atom functional groups of the residues that touch the anchor leg and its orientation in space is extremely strongly conserved. And so you have a very funny situation with the back bone inside chains actually of the protein are adapting to shove this carbon you'll group right over here. So if you actually you can turn the problem around and we have we have a paper due this came out so it's one site refinement you use this you used. Normally you have a protein and a foot in the leg and then you use the living to refine the used protein to refine the link and this is exactly the opposite. We use the ligand to refine the geometry of the proteins and the average correlation call for heavy atom correlation coefficient of the conserved guise of score and matching this in our midst. He is point five for having And I've never seen anything like this. I suspect this is actually capturing. Some many body low lying free energy states. That are were in fact why it's conserved. I suspect this is actually a very general featured that's when it you can actually use to build much more specific potentials but these are inherently many body. And I think that these would be actually used to build better potentials certainly full of good refinement that are they at the heavy atom level. It's great and it's very fast because basically what you do is you read these things off they they basically triangulate functional groups and then you would just the rest of the protein back going to fit it. You simply relax a very simple idea. The very strongly conserved in space. This is actually the priest why why the B. factor is so low. We don't have time to talk about it. So this actually has a lot of very deep implications. As to the natures of potentials and you know what's really conserved. It's almost behaving like an iceberg a very dilute lying deep free energy states that are completely missed by contemporary force fields. But are in the structures. And they're very strongly conserve the course of illusion. So you look at these things and you'll see. You know like for example let's say you have a spark in a glutamate So this as particle be sitting here. The backroads wave in the wind the experts who exactly the same location. So by looking at a link in centric view you actually can see what's conserved. Whereas you the car I won the car to the side change positions are moving all over the map. So they're literally moving to a just this one piece that touched the limit rather than the other way around. So this is I suspect that there's a lot more work that can be done on many body potentials are going to be Brimley important. I suspect this is also true for proteins residues I can prove we haven't proven it yet in which case having the ability to deal with many body Potentials will be much more important because these are going to be. These are coral or they have a lot of correlations attached to them and I know your interest in that. Now yes it's become a cottage industry. I mean it's very interesting to watch the citation index of this whole thing to take off. There are on find site alone. OK I came up with the idea I guess that was true. That's how I read you need it. I told Michael Bet you that this is the region go find it. And so Michael being extraordinarily talented a few weeks because you have to do that you know it's a non-trivial problem. You know I give you Lincoln's you know match the league and find the common largest substructure we now have some very nice person when they so find Sidel H.M.. I'm a better price matching algorithm for you guys got a better looking. Matching out of the glove to use because that's actually the most expensive part of this calculation. The problem is this I give you a chemical moiety something. And I want to match it to the yank it may match eighty percent so what's the best fit. Of this huge wig and with this little trunk. You know obviously if it's if it's in our magic group. It's trivial but quite often it's not so trivial you have to have deletions in insertions and so it's in principle an N.P. hard problem. But there are numerous the caliber of them is out there. Yeah I mean so the answer that people are now this is at least four or five different groups that have copied it in one way or another not probably be just because that's what they're interested in doing is optimizing it and no they didn't. There was they would they would It was very interesting. Patsy fabric was ninety percent of the way there. But they didn't see it. They didn't ask the question right. I mean this is what knowledge you're modeling. This is the literally home all of your modeling on the look and there's no difference. So if you remove this is how I got the idea. I mean if you remove the artificial the vision of ligament and protein and view it as an object. It is literally like that's what I like him stands for as little as you moment is exactly there's no more intellectual content in it then in regular homology modeling which is a deep idea but it's there and I wish I was it was mine but it's not. Turns out it's very general it's also true for protein. It's actually true across nature. I mean it's true for D.N.A. protein interactions it's true for. Features or protein protein interfaces it's and it's absolutely true. It's true for. We haven't published work that we're about to submit. The main domain interactions. You know an unsolved problem in structure protection is predict the orientation of two domains now for two domain proteins two thirds of them means mostly on the go sheer very small scale motion so that's a trivial one. To do. A couple of inches. But a third of them are Pacman So the a pro forma is like this. The hollow form is like that turns out that the hollow binding pose is very strong and conserved more so than the Maple can be all over the map but how it binds a common link in the middle shuts of course in evolution as it's exactly the same idea. So this idea of conservation and variation seems to be ubiquitous through protein structure and so the so it is that it is the most robust way of inferring. A looks like B. I know because this is what nature actually has done it's discovered something and then it copies it over again and over and over again but there's a flip side to this which is that probably the promiscuity of the molecules and cells have been grossly underestimated and this is why people have such a hard time with designing a specific drug because I don't think it exists because you can find same binding sites in evolutionarily unrelated proteins. But all the time. There's a lot of course there and the fact that this should be true is suggested by the following. If you look at the human metabolism. You know how many would have a life there on the human metabolism on the order of five thousand Sean left by a factor of five. Let's say let's say they were only discovered you know twenty percent of them. You know. That's not a lot of compound space and yet nature has used these small molecule for life. So what is the Collect we've heard that. It's all one molecule is used for one signaling process once probably pretty well it's been around for a billion years of evolution and nature is opportunistic. So there's a so it's an intrinsic feature of how these things and maybe independent discovery. By accident. Turns out you can do it by accident. A binding modes certainly can do D.N.A. binding by accident. It's just a feature of how the systems work. They're very noisy there's a lot of course reactions. And this is why it's very hard to have one molecule one drug that's why you have those crazy curves with the kinases because they have similar binding pocket similar chemical things. And in the cell. It's also probably true that the separately with this all the time. You know and so there's got to be robust feedback loops and regulations and it's probably much more like a set of conditionals that you set up something that looks like a pathway rather than the really is one. But that's just conjecture but that's the implication of all this stuff so it's got a lot of implications for how you design all this stuff. But promiscuity. Of protein protein interfaces of Logan seems to be a feature of nature. There's specificity but there's this huge background of crap you've got to deal with to get out the signal from the noise which seems to be an interesting. So you know from off by a factor of ten. It's still pretty targets. I mean when you go to the Human Proteome. And we do this promiscuity and it's he was no case we found. Where you know. We only find a very few and OK so our scoring functions are terrible. So like I said I'm ninety percent wrong. I hope not. There's still a lot of off target interactions. And actually we've done this because one of the things that we're doing now is we're using this to predict. New uses for old drugs. Because we can't we have done this and we found you know retrospectively with as well. Characterized examples like select Cox it. OK. It is actually a COX inhibitor. I believe. A Cox two inhibitor. It also inhibits matrix McCalla protease because they have the same binding site completely different global fault sequence identity eight percent unrelated. Why it works is that the geometry of the binding site is conserved very strongly conservative the residue so it's probably convergent evolution. And there's a lot of cases like this and I suspect that there's more. So one of the things you can do is as we know we are now preparing a database of all of target interactions of all drugs. And then you just got associate the target with the disease. Prove that it blinds and now you're in Phase two clinical trials because it's already approved. So that's kind of you know what we're trying to do now. So that's the implications of all this stuff. Yes we will we were actually one of the better groups or not the best group in the room and they'll be a paper on the latest G.P.C. our prediction of Ray Stevens. Now remember we're not using There's no manual manipulation. There's no literature searching. There's no experimental data in here and on one of the targets we were among the top five best I think it was five or six best human groups that study the literature and you know manually when it will beat us. I'm in which they're going to do because you know if you have a wealth of data you know you can position it better. In cast we were the best group in terms of the templates free and depending upon how you measure they were there were a whole cluster of them. I mean my formal post-doc Yang Jang was the best because he really is very good at these things. And you know this kind of like in the top five percent. You know there's this pack. You know in the pack It was just pep previous the December two thousand and ten was the meaning. But the reality is you don't need data structures. For this exercise for looking screening you don't need better structures. You don't even need to build a structure actually can use a thread in my life to save this. Spent the last time trying to build it. You. You don't really need to build a three D. structure you can use the threading alignment because it captures the binding pockets and that's enough you can forget about the rest. You'd think you'd do almost as well I'm not quite as well but yeah it did quite well. Yeah in Casper I mean it did well for the binding pocket predictions. It was OK for the binding pocket rejections. Yes Well I don't know we just started you know we just published a paper this fall. They've been coming out of the woodwork I mean you know or into discussions with. G S K. There's a company in Utah called Super Gen which basically is license the software is putting a post-doc in our lab so they can learn how to play with it. There's a number of other companies were in discussions with full license databases so I don't presume to tell test that can break it so then we'll fix it. I mean or better yet they'll test it and discover something where I'm actually starting an experimental lab. So I'm actually doing a lot of screening myself because I got tired of waiting for other people to screen. Yet and the structure protection part is that's that's that's the hard part is predicting the structures that go into this is the expensive part of the prediction because that's a couple of days per sequence. You know it could be a week or two weeks or twenty minute depends upon the level of difficulty. Yes for the bigger ones. It takes a lot. You know there's a convergence and you know we roughly know based on this detail benchmarking aside is you know little ones fall very quickly or are they. They have very similar templates so they're truncated in ten minutes. So if you go on our website. There's this thing called taser light which is designed to twenty five percent sequence identity that will do it in a half hour up to three hundred residues but what if you have an unrelated protein. That's fifteen percent identical or. A template free then it's going to run for weeks a couple of weeks. No one told one so it's one sequin it's one core. Yes So now we're doing lots of you know obviously we're doing lots of genomes and then what's more expensive What I didn't talk about is the prediction of protein protein interactions because then we have to you know we have to do the thing per target and then we've got to do this cross reactive interaction prediction that we've got to refine everything and those are much bigger molecules so that burns up a lot of the computer time. OK Well this is going on the computation only needs the biggest impact there's not the problem that I talked about today whatsoever but we're trying to simulate molecules actually with the headman. You know trying to do you know just kind of following up on you know we've put together a kind of proto model of Nicolae we were looking at the fusion. And those things we don't have anywhere near the computational resources to do them. I mean you know there's a lot of timing scaling I mean you know all those things you know that you know that inversion of you know dense matrices issues. You know are a real nightmare. You know we're only or not it because we need to simulate. Sub milli seconds for thousands if not tens of thousands of proteins and right now we're kind of hundreds of microseconds for five hundred. It's kind of what we can do so there's a lot of algorithm development that needs to be done and you know we need to extreme help and that. The other issue is you know. Very very. Useful would also be better approaches for dynamic programming because the most expensive piece of this is the dynamic programming. Because there's two things One is the conformational search which we may or may not need depending upon the target. OK And there it's again it's a search. It's a search optimization issue. You know we're using variants of replica exchange money Carlow with the. Essentially energy landscape and question the barriers. It's the least worst of our bunch but it's hardly perfect for the big For the bigger systems were we can't do them. That's one of the reasons we can't do them is they don't convert. Also if you just look at what numbers of cycles we spend a lot of time doing that Amec programming. We have faster. I mean that's just a generic because the threading is pseudo dynamic. But what I But what I really need. Is a non-local algorithm. Because the threading algorithm includes parent actions. How we do it as iteratively now is we basically use a sequence profile term which is a one be dynamic programming to generate the partners to then evaluate and refine I really need a non-local algorithm to do that. I don't have one that works. I mean I have one but it's so slow I can't do what on anything but a very small number. Because I want to evaluate an alignment but I need to know the full alignment in order to evaluate the fitness of the Alignment. So unlike dynamic programming which is local. This is a non-local algorithm things to do that could also be very useful and then there's this whole issue which were desperately in need of which is unknown him. We have literally terabytes of data we want we need help people with that some graduate student to help us build a database and we need people to help us manipulate and interrogate and present this data. Because the amount of data that we're going to generate is going to be extraordinarily large. It's and it's only getting larger both from the point because we're doing the structural function look in screen because remember we're. Winning eleven million compounds against all the proteins predicting protein protein interactions predicting protein beginning interactions predicting cross reactive interactions and then we're going to do this for all sequenced proteins. So it's a pilot. It's a massive data and I have no experience with databases at all and visualization interpretation of them and fast searchers I mean in all the stuff you know. The scale of the cow problem gets quite big and so I mean those are places where I desperately need help. You know that you know because this is just this is. You know not of the stuff actually looks mildly interesting and worth at least suggesting people what they mean want to do when it's no substitute for an experiment but it's a it's a sub so it's a good substitute for guessing. I mean we're going to we're going to have databases that are probably conservatively a petabyte of data within a couple of years. And this doesn't include the trajectories that can. Basically go without a bounce.