Cool. So thanks for having me. And it's not my first time in Atlanta but it's it's I think my longest visit I stayed here over the weekend and enjoyed your beautiful city for some of its. Better known features like the Atlanta Braves and the aquarium and so forth so that was really fun and it's my pleasure to speak to you today on the subject to motivate the presentation. I'll tell you first that our. The environment I'm in at the brode where we do computational chemical biology research is one that's characterized by a tremendous effort in synthetic organic chemistry. So making molecules that will go into our high throughput screening collections. And we have probably forty. Chemists real chemists not like me who make such molecules and who are constantly looking for guidance on what to make. We also have quite a number of different high throughput measurement techniques including sort of conventional high throughput screening but others. It's also some profiling techniques that I'll mention in the middle portion that provide an opportunity to inform synthetic chemistry. As to how to grow a more optimal screening collection over time. I think it's pretty common for people thinking about a particular biological problem protein or pathway to to learn how to optimize chemical matter around doing that job and so that's sort of conventional Q.S.A. are or essay our work. What we've focused on is instead how to how to improve the overall collection performance and I'll go into the details about what I mean by that in the first part of the presentation I'm going to I'm going to motivate why we need compound synthetic compounds at all because there's a fair community of chemists out there who would argue that nature is a sufficient source of starting points for small mom. But all you need to really do is go out to all the plants and the sponges and so forth and squeeze them and fraction of that stuff. And now you have all the drugs you'll need we undertook a study to ask whether or not. That was true. That's based on biological network connectivity. Of proteins. OK. So as some of you may be aware it's been known for over a decade that biological networks when considered by protein protein functional connections whether they be direct binding. Enzyme substrate relationships or other regulatory relationships can be viewed as networks of interactions and within such networks there is in general small number of highly connected proteins or hubs and a much larger number of less connected proteins which will call leaves. And of course the continuum of connectivity is in between. Different biological network graphs differ in their connectivity distributions and this can be quantified. For example here we have a high leaf to hub or a show and this is a made up network for cartoon illustration purposes and a lower leaf to have ratio with the same otherwise the same number of nodes and edges and the extent to which distributions are different can be quantified using. Cumulative distribution. As as illustrated here with this cartoon between the number of connections and the log of the number of nodes that is the number of nodes with each number of connections. So it was known since two thousand and seven that in humans disease genes exhibit intermediate network connectivity. This is not surprising because in in the human hitting us. Hub with a small molecule or a gene or some mutation is more likely to cause acute death and not to be heritable. Whereas these terminal terminal nodes are more like or less likely to make the organism sick. So in between these are intermediate disease genes that. That because of their intermediate position in the conduct of any distribution are more likely to be inherited by offspring and therefore comprise the set of disease genes that are present in the human population. So this is a publisher result by others. We hypothesized that natural product targets would be hubs because natural products very often are selected as chemical warfare agents between competitors and you don't want to kill your neighboring later or kill its children you want to kill him now. So you can eat his food right that's that's the deal. So the solution we surmised was that. That hubs would be would be enriched among the targets of natural products so to to look at this question we engaged to. Existing databases one called strain which is a publicly available database of protein protein associations. And it's available from the European molecular biology laboratory and we used it as a foundation for analyzing network kind of activities. Again some of these are direct interactions some are and I'm substrate some are indirect regulatory. And string has a good scoring system for how it weights the evidence for these so we just took it at its word we mapped natural product targets on to string proteins using a commercially available natural product database from Bio they sell a big product which has a ton of compounds in it for all different purposes and we negotiated with them. Because we only we didn't want to pay a bill for the whole thing. So we only wanted cases where the compound was known to be a natural product and the protein target was no and that turned out to be some eight percent of their database and they did in fact charitably sell it to us for only ten percent of the cost of the full database so from this. From this database we identified nine hundred forty six unique proteins in from the human that have no natural product targets. We also looked at morbid map from N.C.B. I which is a list of mostly heritable human disease genes and BILLION heritable human disease genes as well as the emerging database of genome wide association study genes so genes that are implicated in human disease I.G.Y.. And together we identified about twenty seven hundred human proteins in the string database that are implicated in human disease. Our initial observation was that our hypothesis that natural product targets represent highly connected nodes was indeed confirmed. Shown here is just kind of a been histogram. Of number of connections per protein. For small numbers of connections and some intermediate ones and then more than fifty. Over here is the actual cumulative distribution of all the the whole database so natural product targets are in fact. Right. Shifted in this plot indicating they're more highly connected all all three pairs of differences between these are statistically. So all human proteins show. The most left shifted that's because of what number of lowly connected proteins disease genes here in green are intermediate and natural products are the most highly active. We next asked whether it's. Whether proteins targeted by synthetic small molecules and not natural products. Have a different kind of committee distribution the natural product targets and we find that they do. So what's shown here are the same. The same. Green and red as before human disease genes and natural product targets and now in gold. Bioactive targets gotten from Campbell that exclude natural product targets so any protein for which we could not find it naturally and but we could find to synthetically and is shown in gold. These synthetic Lincolns though are not are not drugs. Some of them are probes and other tool compounds so we look further using drug bank at actual drug compounds that are not synthetic. That are that do not have natural product targets but only synthetic targets and we see a further leftward shift of the approved drug targets from drug bank over on top of the human disease gene so this tells us that synthetic. Compounds that are being made or that have been made to target human disease genes. Are our populating or antagonizing if you will part of the proteome that's not being hit by natural products. So we take this as as evidence that. Making molecules is a good idea. That's the summary statement of this. Of this first portion don't just go out in nature and find starting points but in fact dream up new molecules in our chemists are quite good at this. But what they always ask us as computational biologists and chemistry Maddux practitioners. OK fine. So which one should we be making which compound I can dream up all these molecules which one should we make. To to articulate the challenge of this problem. I'm going to use some fun cartoon slides to illustrate a concept income informatics. And I think in similarity measurement. General similarity is a relative concept. If I show you these pictures and I ask you. Are these two pictures left and right similar you might say well what do you mean by color or by shape and that's a valid question. Indeed with this pair of objects shape looks kind of similar color not so much. And now you're already doing something subjective. In addition to what shown on the screen you're comparing two things in your mind the taste of a mango versus that of a capo and applying and saying no way these are similar. So it depends not only on on on the descriptors you elect to use but also on subject of interpretation. And this can be further illustrated by this picture where almost all of your similarity difference. A lot of arms that are going off in your mind are subjective and have nothing to do with the colors or shapes that are represented on the. Chemists do the same thing they look at these molecules and they make subjective assessments both of these are drugs. Both of them are effective in humans but they're quite different to even the untrained eye. So. I'm going to spend a couple minutes talking about how in practice we can try to dig into this a little bit more using measures of of local structure versus global structure of small molecules so chemical fingerprints are very popular way of characterizing small molecule structure in a computer. They are typically half coded bit strings that are assigned by looking at all. Substructures of a certain size in the molecule and assigning beds associated with the substructure being present or absent. In the molecule these measure look because of their their their cutting them the molecules up into small substructures they measure local structure and the familiar tandem OTOH similarity measure which is used. It is in fact a fingerprint based measure it's it's tailor made for problems of this kind where you have two. Bit strings in this case representing two compounds and you want to know the fraction of on bits that are shared between the two structures so when somebody says the tandem OTOH is blah. This is what it means but. OK. But is that really what it means. So different. Fingerprints emphasize different groups. So here's Tanna moto coefficients calculated between this compound and three other similar compounds using a particular type of fingerprint descriptor that's out there in the public domain and you can tell just from looking at the scores and the differences between the compounds that certain changes are considered more important than other changes. Based on this description. So even within the world of fingerprints it matters what you pick. As far as what you get so just just to be specific. Here this methyl group is the same here. The niter group is different at this position. Whereas here the final group is the same whereas the Napco group is different. This tells me that. Because the drop in tandem OTOH is much larger for the. For the nature of group versus the final group that. These particular descriptors consider methyl The nitrile more and more important change than they do. To that on the global structure side. We can look at one popular method of assigning. SHAPE TO molecules This is developed by Sauer and Schwartz in the in the early part of the. Last decade. Called principal moments of inertia. This is essential like finding the best ellipsoid for a small molecule some pose of a small molecule but. Weighted by math. So it's a moment of inertia calculation that gives you three amplitudes and then by taking their ratios. The short one over the long one. The middle one over the long one. You assign each compound to a position in this triangular space that is bounded at the corners by. These canonical shapes Rod disc and spear and you can conceive a similarity measure as a distance in the space between two molecules that reflects their difference in global shape properties as opposed to local properties like the fingerprints described. So to give you an idea of the size and dependence of this. If you consider this corner here both the bucky ball big thing and Adam and Cain would both be pegged at this upper right corner. Even though they're very different in size. So it's giving you global shape but not size. If you look at a random collection of compounds here are four hundred known bioactive small molecules in in the broad collection using the two different descriptors you can kind of see what you're getting at with two different sets of these two different descriptions. In the case of fingerprint descriptors you get basically a low average similarity between all the molecules a poor mapping to overall shape or complexity. But very good discrimination of closely related structures which is evidenced by these kind of these blue little blocks right here on the diagonal. With the P.M.I. descriptors. In contrast you get essential a high average similarity for all molecules and. Despite the size of pendants of independence a pretty good mapping to shape and complexity. But but as described a minute ago similar molecules can have very very different. Structures. Similar similar as judged by this. So what does this really mean it means that chemists despite asking us what should I make really need to decide first what is it that they want they want a shape to be a certain thing do they want local structures to be a certain way. What are the what are the properties that they consider important and I'll come back to how we can guide that guide that decision making in the last part is everybody. Any questions here before I go. OK cool. So to start to get at. What some of these products how some of these properties are similar or different between different sources of compounds. Natural Products compounds that are in typical pharmaceutical screening collections and those coming from the academic synthetic chemistry community. We characterize about fifteen common fifteen thousand compounds that at the time had no previously known biological activities. So again commercial C C diverse compounds here we're calling. D C and natural products and. So we just want to know can we can we quantify what chemists are telling us they intuitively think are different between these compounds. Just a touch. OK And so I think this slide this is just five or six randomly picked guys from each group. That kind of give us a sense for what the differences are these commercial compounds in general are much more flat aromatic compounds. With. Not a lot of side chain variability. The natural products are pretty hairy lots of stereo centers kind of bigger. More. Complex and these the synthetic compounds from from academic chemists are kind of intermediate they have some features of each and as we'll see that that's borne out. With more quantitative studies. So two things that chemists told us. They were interested in they they kept saying we want to make things that are more natural products like and I don't I don't really know what that means because natural selection makes natural products and there's no way to emulate that laboratory. What you could do is make compounds that have features that are reminiscent of natural products or similar to natural products but it's not clear that that's going to automatically confer the other properties natural products have to wit somewhere in the ecology of Earth. There's a target for every natural product. Otherwise nature wouldn't bother to make it. It's too expensive to make compounds even for nature. So. Two things that are chemists were interested in as knowing whether the stereo chemical complexity of their compounds. Or the shape complexity of their compounds as judged by two metrics shown here. Were approaching those of natural products and we quantified this in an AND number of ways one of which is shown here. So then the total fraction of carbons that are stary genic centers chiral center. And the total fraction of either S P two or S B three carbon that are S.P. three so both of these vary from zero to one and provide a continuum in in what we observe between the commercial compounds which are much more flat and less complex by both measures and natural products which are much more complex. By both measures and show that these compounds from academics and that chemists are in general sort of intermediate You know they're getting where they are. As judged by their phrase natural product like they're making what they think they want and. We can take this further and look at properties that say medicinal chemists care about. So here are. Several properties that underlie love Pinski rule of five and a related. Rules for filtering compounds molecular weight log and so on and we see that and in in four out of the six cases. The diverse compounds are in fact intermediate in their values and for the two that that they're outside of the C.C. N.P. they're not far outside it and are still for the most part within acceptable ranges although these are a little bit heavy Penske's rules. That that chemical diversity is is really relative like similarity is. All illustrate with three different ways of of Clare. Characterizing these compounds sets. So if you take the six properties on the previous slide and you principal component analysis and then show as I've done here. The top two principle components in this case this accounts for about ninety three percent of the diversity of this particular compound collection. You can see that those the compounds. From Nature the green ones spread out in a direction different than the compounds from the synthetic chemists and this map of the factors on to this space illustrates how they're accomplishing that diversity So nature is accomplishing its diversity by varying the number of hydrogen bond owners and accept or polar surface area. Whereas. It does not. It does not do as much variation of rotatable bonds and size whereas most of the. Most of the diversity of D.C. is from changing in size. If I look at simply atomic composition. We can get another picture of what chemists are doing versus what nature is doing. Both of the synthetic classes of compounds that the more flat like. Typical commercial and the academic synthetic chemistry compounds are very nitrogen composition and to a lesser extent carbon composition to get most of their variability. Whereas the variation in oxygen composition is much more suppressed than that nature on the other hand does a lot of it's very easy compounds by changing the composition of option. So the number of options and molecules much more variable on this that. You can do the same trick with. Any descriptors you like I show here aromatic rings versus side chains because it sort of maps onto these again so changing the number of of rings and aromatic Rings is a big driver for the diversity of the synthetic compounds changing the number of side chains for natural products. So this gives us some clue of what it really might mean to be natural product like. In in their parlance. Something that chemists have become increasingly interested in the past two or three years is this notion of a scaping Flatland earlier studies by us and others have indicated this theory of chemical complexity. As I'll show in a few minutes is in fact. Better for outcomes in biology either as drugs or probes. I'll come to our evidence for that in a few minutes and this paper by Lovering was entitled escaping Flatland and took this to clinical compounds and addressed the extent to which more storage and compounds were survive longer during the process of clinical trials. What we see by assessing using the principle moment of inertia is that the efforts chemists are making to make more globular like compounds that is those more. At the top corner. Are in fact working they're shifting the distribution of the blue on up that way. Relative to natural products and commercial compounds that the data here are are shown as cumulative distributions from either this corner or from the middle of the side in these two plots. And you can see that the D.C. are. More like spherical than either and P. or C C. So is this is what chemists have decided over the last couple of years they really want to do. I don't know whether it's the right thing but we know that we can guide them with that goal toward doing it. I'll just give you two quick examples where we published a study with a chemist retrospectively that. Considered the relative role of Styria chemical diversity within any of these skeletons or skeletal to versity within the branched reaction pathway on shape. And were able to quantify the cases in which. The the difference between a precursor skeleton and its antecedents Gallatin were more important or less important than the difference in dietary Merce at each of those positions. So again this was done. Retrospectively the chemists in this case did not have screening data at the time to support which of these was better. I'll kind of come to that in the last part of the topic. But it shows that we can quantify these choices that they're making in terms of relative importance of different types of diversity. A more a more global library decision making process can also be measured with respect to its effect on shape yours and other chemistry collaboration we engaged in in which two different libraries were made one published in two thousand and eight one in two thousand and ten. Where. But essentially only one intermediate part of the library synthesis was changed and we wanted to look at the consequences and shape of the old versus the new library and were able to quantify that again in terms of even in this case distance from the the sphere like corner and make the make the statistical statement that indeed the blue library is more spiritual in nature than the read that may be true but no one has shown me that more spiritual is better to my satisfaction and I'm in the business of showing that. So this gets at the last last. Sort of half of the talk to our chemists really picking on their own the right. Desired property. To go after it's clear that if they have an idea what they should go after we can help them get there. That's that's kind of a straight up computational exercise but how can we help them pick the good ones. So I'm going to I'm going to take a slight diversion into. And what is sort of another area of research in my group to tell you a story there. But that I think importantly sets up. The last part of the talk the last five minutes I'll come back to this. So what inspires me is the following all of this coming from magic stuff that you've seen is about taking compound structures that you may not have made before and computing something about them in advance of their synthesis. That's really important. But it doesn't really trump what happens when you take and make a molecule and go and measure something or some things as we'll see about it in the laboratory. So to me and a completely alternative representation of chemical similarity is performance similarity and that's where that's what I'll talk about in this third section. Anyone have any yet and they told me to wait till the end but I can't I like it. I like questions too much during. No. No we didn't consider that. As all show in this section we actually we do by having experiments with one hundred protein. And we treat all the proteins equally. In that study. But you could do that. I can even imagine clever ways to use some measure of complexity or binding site complexity to weight. Some of the things that you'll see need present in the context of each protein is considered as good as any other that we haven't done that. I mean it's that's a useful thing to think about for future work and collaboration particularly with people that think about protein structure complexity a lot which we don't. But that that's definitely the subject of this section so. We make a measurement we take a small molecule and we put it on some biology. And we make a measurement and ideally on the same day at the same time we take D.M.S.O. vehicle and put it on the same biology. Ideally a bunch of times. And then we ask how likely is the warm that had the compound then. To be a member of the MT treatment distribution. And that's kind of a score for whether the compounds presence was required to observe the effect. So it's like a probabilistic measurement of success. And we do that in a bunch of contacts which I'm about to go through that. All right so here's here's the idea right and this is the this is this is the sort of generic slide and I'll show you several. Samples. OK though. So this is what I just said we take a compound. We take some biology and we test whether the compound did anything and it. If we do that a lot of different times with a lot of different biologies we can get a vector of measurements that come from all these different assays whatever they are so this is not a new idea. OK The N.C.I. pioneered this idea. Twenty five years ago. With the N.C.I. cell line sensitivity profiling experiments. In which they took compounds and exposed them to sixty different cancer cell lines and took a second compound that did the same and articulated a level of similarity between the compounds that had to do with the differential sensitivity with these lines. So. Other work that has been it and these kinds of sensitivity profiles they used them to do things like connect compounds and targets so they had some of their original work that we reviewed in in two thousand and nine. Like sixty cell lines by four thousand small molecules and they'd also measured protein levels. So these are just so killing experiments and these are protein levels for those same sixty lines against some number of protein targets and they did a nice matrix algebra thing to. Connect small molecules with targets through the common intermediary of the sixty's cell. So this is good. This is kind of the this is the grandfather of all profiling as far as I'm concerned. You'll hear the people like that. That I'm about to talk about forget about this but this is the this is the beginning. Profiling has been done using gene expression data. This is probably the the kind of profiling that's most familiar to most people. The two studies that were most relevant to small molecule perturbation pro. Filing are the compendium in yeast that was done by Rosetta in two thousand and the work at the brode of the Connectivity Map or small molecules in Penge on a cell system you measure gene expression profile and use that to make connections between different compounds between induced states of cells say by our own AI or whatever and different cell lines whatever it may be and so the conic to the map some of you may have read about is is the sort of web tool that enables this functional annotation. Profiling can be done with image features so high for high content screening assayers where you impinge on it on a cellular system with small molecules and measure a bunch of different features by microscopy with different fluorescent dies or whatever can be used as the basis to form profiles. And. This can connect small molecules sensitivity to different phenotypes with with a compound. Profiler is an example of this that was also developed at the brode So these are these last two are examples of datasets to which we have access and think about and then. The one thing my group has developed over the last couple years is assembling profiles over many diverse assaye So the example I told you about for the investigators set out to build a profile. By taking sixty cell lines and doing something or running gene expression. We built chem bank in part to enable assaye performance profiling across many different assayers that were not done with the idea of building profiles. They were done because each biologist wanted to discover a pinch of magic dust for their own biology. And so we just needed a way to reconcile the scoring between these to put them onto a common footing. And this kind of idea. We've used it as what the other profile when you've seen to enable lead hopping and target identify. Studies I'll come back to one of those in a minute I'm going to spend three slides on how how we do it. Which gets back to your question. So we score. So the one thing you can be sure to have in any small molecule perturbation experiment is the corresponding vehicle control. More of them is better and replicate so the compound treatment is also better. So we handle replicates of this which we try to do a lot of times and of this we try to do as many times we can afford usually two or three. In a principled way. So that the evidence is accumulated over the appropriate number of mosques that share a plate with the compound. We do transform the data then to emphasise the decision boundaries. So that for a particular confidence of where the compound scoring we suppress differences down here where it's all in the noise anyway and we also suppress very these are kind of like Z. scores on their own. So after a certain point it doesn't matter anymore it did it right. Point nine nine nine nine nine nine nine nine we don't care anymore. So we transform into using this double sigmoid into a regime where we're emphasizing the decision boundaries were really matters. And then we normalized to account for missing data using Fisher's the prime. To take. Pairs of compounds that have shared different numbers of path. And normalize their similarity scores between each other so that we can rank for any given compound we can say what asses has it seen in common with any other compound. And did it perform similarly And the answer doesn't depend on. The number of assayers that that share being different. So this is now a fourth novel profiling type that we developed and can give rise like the others to pairwise similarity between any collection of common. Right. So. One so. So again here's the summary going back to the through these various types that no matter what you're doing you end up with a measurement vector that you can use as the basis for similarity and diversity between compounds. So this is an instead of a cam informatic profile we now have a measurement vector and ask a performance profile for each compound. And this can support target ID I'll give you an anecdote of how that. Organ. Of how that works for some we published last year is just one example. So we take sparse data in can bank bunch assayers were done up to a thousand on a bunch of compounds but not every compound sorry as a. We had in a separate assaye we had identified a compound that induces insulin expression and pancreatic Alpha cell so this part of our disease biology groups work. And they came to us and said we just found a suit compound. We want to know what it does so can you tell us what compounds in the in the whole historical data set perform similarly to it. Several compounds had similar profiles. According to our. Threshold of similarity metric. In fact eleven and they were similar enough that we were willing to tell them about them. And two of those nine were an annotated but two were previously known to be kindness inhibitors so that immediately gives us the hypothesis that our new compound is a kindness and had better. So we sent it out for kindness inhibitory profiling to her and indeed confirmed that it is an inhibitor it's not super selective but it is a kind of inhibitor and it's it's inhibitory characteristics or enrich for particular parts of the kind. So it's not a broad spectrum and have better but it's within a fan. Only it's not so good but it kind of goes to two ish families and they're still investigating what this compound really does and in out in Alpha cells that makes this work. So this is an example where a profile similarity can support target ID So that's sort of the similarity version of performance profiling. Another thing that we're that we're doing this is more like just a heads up for the future because this dataset I think is going to be valuable to the community. Later this year will be releasing a dataset where we did. All three of. Gene expression. So you are morphology and cross essay profiles for three sets of compounds. Twenty thousand from our own internal model synthetic chemistry collection. Ten thousand from the National Screening collection them up the C.N. collection and three thousand known drug and tool compounds. And so by relate relating similarity relationships among these people will be able to go and find novel chemical matter that does something similar to a compound that they've studied or to a chemically similar compound the something they've studied so we for this middle set we pick them so that they be. Most representative of the. The active subset of the national collection. So we're going to talk more about this today but you can. Look for a dataset to come out later this year or early next year. OK So that that's performance. As an alternative small molecule similarity. So we've talked about chemical similarity and diversity and now we've talked about small molecule similarity. The last bit. I want to talk about small molecule performance diversity and how we can use it to guide synthetic chemistry so coming back to the idea of can we actually tell chemists what. They should be making. So remember this slide this is where we left off that little segment. Chemists may or may not be picking the right properties so how can we get at this. So remember we looked at structure properties for these three classes of compounds. We also did a. Performance Analysis of these compounds using small molecule microarray So as a as an ass a platform so we print compounds on glass lights come in with labeled protein. Wash. Then look at what secondary antibody it what's value. So we have a matrix of fifteen thousand compounds by one hundred proteins. That that represent primary punitive binding interactions. This is what those look like. This is just a binary. Heat map of. Showing the the composition of the groups and then the one hundred proteins. So one of the things we looked at is whether that be the hit rates overall were different. Among these. And we did find that that the diverse synthetic chemistry compounds had a slightly higher hit rate and less disperse hit rate than. Commercial compounds and both were much higher. Maybe ten fold for this one. The natural products. Which makes sense because natural products are exquisitely selective for something in the ecology and we only picked one hundred proteins from humans and so we shouldn't really expect that we found the right target for that. We found that when we look across the hundred proteins and ask whether compounds hit exactly one or two to five or six or more of the protein individual compounds did this that the diverse and natural compounds were in rich over represented for highly selective binders or in the case of natural products both the highly selective and the media. Selectivity binders whereas the commercial compounds were enriched for. Four nonselective binders and under and under represented for selected binders. This was just using compounds that bound at least one protein. We look at all the compounds again so hit and not hit compounds and now regardless of what source. They came from natural or otherwise. But instead mapping onto our stereo genic. Ratio. So compounds with no stir genic centers up to a quarter of the carbon tax enters and more than that. We see that the most complex compounds again also in rich in natural products or in rich for non binders. The simplest compounds are enriched for permissiveness Viner's and the intermediate complexity favors. Favor selected binders. More recently and we also analyze the same data set with respect to the P.M.I. plots that I showed you a couple times and we do find that there is this significant enrichment for. The specific behaviors among sphere like the significant enrichment for severely compounds among. Selective binary than. Rotten dislike compounds among the more promiscuous by. OK So this this is all great right. Or chemists love this. They were like this means our compounds are the greatest thing because this happened. Medium hit rate low selectively for these typical guys commercial compounds that rates are too low. Even though they have high selectively and are compounds are good at both but this doesn't address an important distinction that we're new to thought experiments real quick. Or you can consider it one if you like. So one hundred protein binding ass ace right. Suppose you find one hundred specific compounds each one binds to one protein. There's two extreme cases. All the one hundred compounds that bind exactly one protein bind the same one protein versus each compound that binds that exactly one protein binds a different protein. So from the standpoint of a screening collection. If you if you like this protein and you care about it you're happy as a clam. Right but from the standpoint of a building a better screening collection this is clearly a preferred outcome. And so we wanted a way to get at this. This distinction. So what's the mechanism that we're going to use to do that and we turn to not and probably not surprisingly to some of you. The Shannon information entropy as a measure of outcome diversity. Since it's going to doing us in in the single assaye case you can imagine for a binary ass a non hit you would get the best answer when. Half your compounds were heads and we'll come back to that in a minute because that's probably not a good situation. We're using this as in more of a joint entropy sense for profiles. So. You. Just for those of you who may be less familiar quickly the intuitive idea here is that when you have different states that are symmetrically disposed the entropy of the thing calling a die or whatever is is simply this is the log in whatever base you're using I'm using two of the number of different. Of the number of different states. It's a log of fifty two logs and so on. Right. In the unfair case there's a distribution of outcomes depending on how unfair it is and it's correspondingly more complex when multiple states are possible and joint entropy is defined here for performance profiles as OK now we've got a bunch of basically unfair. Points that are separate from each other and what we're going to be computing is the probability of getting a particular pattern across all one hundred proteins versus other proteins other patterns. And so this joint entropy measures how uniformly the states are distributed. And so we can kind of walk through this and context of screening experiments in this toy example. So consider three binary assaye readouts there's eight things that can happen. If you've got one compound that did each of these things you'd maximise. The the joint entropy. Indeed any uniform distribution of compounds across the years gives that answer. If you're missing States you you pay a cost. Certain patterns didn't happen if you dilute with an active which is now getting closer to our real situation or we have most compounds that don't do anything you get another. Penalty. So this performs well in our in our thought experiment in a sense in the sense you know if you have one hundred proteins they all bind of one hundred compounds at all by the protein you get you have no information. Whereas And in the case where you bind a different protein you. OK so. There's two issues with this one. That are in the reality of applying into our data and the first one is this entropy doesn't care how much you prefer a given state over another. So this these two examples have the same joint entropy. But this one is reflecting a situation where you had these specific states one zero zero zero one zero and zero zero one having outcomes in them. If you put the same distribution of outcomes on states that are less specific like this permissive us compound that you get the same answer. So we need some way to deal with that and we actually. The so far have. Addressed this by scoring separately sets of profiles with the same number of non-zero entries and then computing a weighted sum that emphasizes more contributions from the more specific states. The other one has to do with missing data. Of course entropy is perfectly happy to reward you for having missing data which it should do because you don't you haven't measured it yet so you actually don't know yet whether. Something's going to happen or not to your level of uncertainty is higher in the advanced advance of making a measurement and we haven't addressed this yet. We simply show both that both missing data sensor to zero to inactive and missing data not sensor to an active which gives a higher answer. So again coming back from two to you know one hundred protein binding assaye is our observation is that in fact while D.C. did have. More enrichment for specificity it had lower joint entropy than C C it which means that if you see in these asses with this particular hundred proteins was concentrating its effects on a subset of proteins or patterns of protein relative to C C which was accessing more different patterns to see how general this was we looked at the exact same compounds across all data known for these compounds and can bang. So this is functional assaye data. It's a sparse matrix because different compounds have seen different numbers of assayers each compound here and seen at least fifty one ass A's no more than one hundred fifty four and here's the case where. Untested is considered different than inactive so the sort of inflated. Caister issue to that I described and again D.C. is is slightly lower and the result doesn't change really the relative result from these three classes doesn't really change when you. Untested equal to an active. OK. So this means that despite the excitement about the diverse compounds being more specific and. Hit rate higher. It's not fair to say that they're a better collection as a collection. Necessarily than C.C.. In these in these particular assay. So last piece. How can we get this back to what the chemist should be making. So here. What we've done is to take the joint entropy notion on subsets of compound. Of equal size centered around different areas in a molecular weight distribution. So we can ask does it matter whether compounds are lighter or heavier with respect to the question of whether they perform better as a group as judged by joint entropy. So this is just one example where I've taken a window size of a little over a thousand compounds. And computed either the unweighted OK I work to three syphon and so I take the science stuff as go. So and computed either the unweighted or the specificity weighted. Joint entropy for that group of compounds in these same half days. Relative to random samples of the same size to give you a little bit of an idea to the on the samples it's about point one. So you can consider like. Heights that are inside of that maybe aren't as meaningful. But. But what this shows is that. For an waited for an weighted by specificity getting actually heavier compounds helps you with performance diversity. But if you wait by specificity that is the more specific cases you're actually doing. Worse. And that was true for the for the. That's And the. I'm sorry. This should be functional data not the specificity weighted That's a copy paste mistake. That's true. More for the cellular assayers than our protein binding Athanase which kind of makes sense in terms of compounds. Either getting in or having getting into cells more easily when they're bigger increase your but then having nonspecific effects when you. I don't know if that's what's happening but it's consistent. But if you imagine chemists now making molecules and wanting to know as a collection should I be aiming down here in the molecular weight range or up here. We can give them guidance now on what prior collections have done when we viewed molecular weight or viewed performance diversity as a function of like a way. And we can look at that also for other properties like calculated log P. your. Which shows that until you get to a certain point. For for at least for the unweighted case you're really performing relatively poorly. But that in the weighted case. That that's a muted. But similar trend. We can do this for our own hybridization ratio. We can do it for distance from the P.M.I. sphere so any descriptor that we can quantify and group compounds relative to positions in a chemical property access we can ask whether it helped or not yet to be at a particular position X. So there's this is how we'll be engaging chemists in the future when we talk about prior performance of compound collections. And how they might tuner libraries to have properties that fall into sir. Ranges based on these these kinds of analysis. OK So that's it. So I think I've covered that new sources of compounds are needed for disease biology. That. Are chemists want to learn which new compounds to make that. Small molecule performance can be used as not only an alternative measure of small molecule similarity. But also as a means to guide synthetic chemistry activities when viewing performance diversity for collections. I believe that's yes. And to end all thank my team who are comprised of a very talented set of about half and half postdocs and staff scientists at the brode we have and have had some very nice visiting scholars and some are students who contributed to this work. As well as our collaborators in the rest of the chemical biology program a platform at the brode cancer program and imaging platform and. Obviously and I H. and other disease foundations for funding you for your attention and hopefully we still have time for questions. Thank you. So for the it's a great question. So for. These ones where we were working with the lips. With a chemist directly in a really small library. We made. So these once and these ones we made several and actually made Gianni and Daniela stare at them and say this one I believe. So that that was very chemistry have been small enough to do it. No we show them several of each like we get we gave them like eight minimum energy ones from a piece of software and let them decide yeah I really believe this one or whatever based on you know they had been doing exploratory and a Mars an X. ray during their synthesis they didn't have that data for every compound so we didn't want to use X. ray for some and not others. But they had a good sense for how these guys were folding up so. We did that in this case in the case of the large collection where we're looking at. Fifteen thousand compounds. We took like these we took thirty. We asked for up to thirty two low energy come from Earth from the cam X. on plug in a pipeline pilot. And then kept those within three of the lowest and then took that actually then took the median values of the three. P.M.I. coordinates and used the ratio of medians to do these plots because we really had no there's no way we're going to inspect these yet this many. Lose. Same thing. Yet. The other is everybody. And the above were so wrong about he says one of them was this. You know. Rocky plays this by now. Are you close to the spot the White House. Is certainly possible but I'm sure shaped. Chemical areas. As you like. You know what spots on yours yours. Yeah. You know and you could actually live rocks or civil arteries was you know is what you will doctors. You know. It's really not guys yours or couple who should. Yes So there's a couple of avenues here I think one that's been explored in the literature and a little bit by us playing around is the. Charge density moments instead of mass moments and so that gets you part of the way there. I think what a lot of people have. In the Kevin for manic field been thinking about is for is three or four point pharmaco for. Objects with pseudo atomic features that are ranged in space that. May. So when you're dealing with a protein that's pretty easy to think about how you're going to orient them. But when you don't with a library then of the orientation problem where you've got compounds that have one or more of these and you have to decide before you can start computing a similarity on that across all compounds you have to decide how they should be oriented. If you like. Like. Who pays orders. So it has been through space and through bond. So yes sure.