What was the last five so we had it was our distinguished lecture was a joy to my patients you know and the great life. So I'm one of those that go. The center of my life. You know that little pleasure and the order in the years I would just be so rich or nice as my life was ordered by the girls in my service or a symbol of our own British life. They should rather sharp and the better. THIS HOUR. Let me just tell you that you can be present by several just by this there is just let the conviction of your cross the threshold but just let the process of your life and there's a lot of it is also likely. You're on generic cause that doesn't go by this remarkable discovery the subject of last night and I really wasn't that little bit by Michael Jackson this out of the end he looked like you better go make a fresh original This is originally from work. Years ago. Yes you live in. You know just like you know any other day styles from Asian problem with George. They're the most prominent right. And that's what the issue of us got it from the loss of your shirt from his famous fantasy life and six years from both I did read it and greatest life and this is my sixth day just don't read it. Five years old at that point and I can see the change is this sort of role model but I always get there and you know you're standing right over your head of your city. Yes And then like that so was it just kind of let what's there. What's left of those five years of good life. You know I don't like this discovery was made possible. You're the liar. You said that I want to go by your methods of monies and by night when I go with you with my lips and was just as it was just like you said the people of biologists while they're very look to see what a science has been top of it was against somebody just of a small measure to the support of the leadership of. This is where you can actually hear this and it a good idea just to be Hannibal Lechter for that matter you should know better than the science and what made us the robots at the scene. Yes it was a bit of a guest. This is a fight with your own. Well first of all. Michael published a number of papers last year. Always disliked it just a couple but it's notices that he says that many of us adults. One of the biggest but he was elected the number of years. One was just in the south of that over was knighted by the News of the second episode that made it possible just standing there and this on among many others while it's not immediately. Just bad but is it all rich is it good music lovers. And now he's one of the you can easily get together with many social musicians Stuart Andrew Lloyd Webber sort of I'm joined by actually good at it as well as Sir Paul McCartney This is all my thing and not. Richard Roth because it's enormous on the sleeve and the play well thank you very much. It's always very embarrassing to hear these introductions and you shouldn't believe too much of it. What I'm going to try to do today is to give to talks in one and the reason for that is that we have a new initiative on the Y.. At the moment and I really want to solicit some of your help on this initiative but I also want to tell you what. Restriction enzymes bioinformatics and you know makes things that have been very close to my heart for many many years. So I'll start off by going quickly through the you know mix of restriction of modification and then move on to this new project which is called calm breaks which is all about annotating genomes. So let me just remind you of what the phenomenon of restriction of modification is so bacteria have constant problems because they're always surrounded by bacteria phages viruses that come and want to take them over. If they possibly can. And so bacteria have come up with a kind of immune system that will help them to prevent pages from infecting them and they do this by encoding an enzyme called a restriction enzyme represented by this little gremlin here with scissors and the idea is that this gremlin will come along this enzyme will cut every time it see that particular sequence within the D.N.A. in this case G T T C is the recognition sequence for Echo or one. This is one of the well known enzymes and the idea is that as the pages injecting its D.N.A. into the cell the restriction enzyme will cut it up into pieces and so will stop the infection. Now of course this could be a problem for the bacterium if it didn't have some way of protecting its own D.N.A. against the action of the restriction enzyme and so it has another enzyme represented by this gremlin which actually modifies this specific sequence that is recognized by the car one restriction enzyme it puts a metal group right on the second adenosine residue and in so doing this protects the host D.N.A. the. Bacterial D.N.A. against the action of the restriction and so on. And the bottom line is that you end up with a competition when a new phase D.N.A. is entering the so between the modification enzyme which is in the cell and has been keeping the host cell D.N.A. protected and the restriction enzyme which is looking for some unmet fellate the D.N.I. to cut a nature usually has stacked the deck so that what happens is that the math the laser is not a very good enzyme it's just not very fast and that's where is the restriction enzymes tend to be very very fast cutting and so most of the time the restriction enzymes gets to the site before it can be methylated cut set up and the cell is protected and this is very good for a number of reasons but most importantly it's very good because in the laboratory these restrictions and zines work very quickly. They're very efficient. They work for us. Typically you can get complete digestions in a few minutes with them and this makes it very easy to do biochemistry with them. So there's a lot of these things are now known and I'll tell you about some of these i just want to give a plug to my database called replace see if you google ribeye if you can easily find it and this is basically a database of everything that you might want to know about restriction enzymes. It's actually one of the very earliest of the molecular biology databases this was started in one nine hundred seventy five and has been continuously maintained since then. Now let's talk about the restriction modification systems because the enzymes have been so useful in the molecular biology lab and for the whole genetic engineering industry. What's happened is that there has been over the years a tremendous. Enzyme and as a result of this. Which is actually very easy to carry out all you do is grow a bacterium make a crude extract throw in some D.N.A. and then ask. Can the extract cut the D.N.A. into nice discrete fragments and you can display these on an eye gross gel very easy to find new restriction enzymes. And as a result of that most of the enzymes shown up here type two have actually been isolated and shown to be functional. So there are some more than thirty eight hundred of these and so I'm sorry been discovered during the course of screening both in academic labs but also in commercial companies including New England by a labs so there are very large number of these an unknown and they've been characterized in fact among the enzymes that have been characterized biochemically this is probably the largest single family many more of these have been found in any other enzyme you can think about and they've actually been characterized in the laboratory in terms of their recognition sequences. But we now know that there are actually a lot of different types of restriction enzymes we have four major types type Rawn enzymes that will recognize a specific sequence but they come randomly away from the sequence and they have some quite interesting bio physical properties but they turn out not to be useful as reagents to cut up the and I because of this random cleavage. There are three subunit enzymes and I don't propose to go into them in any detail other than to say that we have evidence for ninety four functional enzymes that is either genetic or biochemical evidence showing the active and that they would. The type two of the ones that I just talked about but you have two completely separate genes one code for methylation one code restriction enzyme. Type three enzymes also have two genes but in order to get restriction. You have a common. Nation of these two subunits the Reds and the mob subunits and just fourteen of these have been shown to be functional and finally there's a class here called Type four and these enzymes recognize methylated D.N.A. Now you have to realize that bacteria in code restriction enzymes are busy fighting for aging and so of course the for a huge fight back. And so what would you do if you were afraid well we figured let's modify our D.N.A. So the restriction enzymes won't recognize us anymore. And so in fact quite a lot of traders that modify D.N.A. by methylation or in other ways and in this way they overcome restriction barriers and as a result of that the bacteria then figure out well we've got to do something about these and so they developed enzymes that would specifically recognize modified D.N.A. And so that now a whole bunch of enzymes that we know that recognize methylated D.N.A. and there is a little bit of a man creature issue here because some of these and zines have been classified as type two and some this type for and that's something that I'm going to be doing something about over the course of the next year. So the nomenklatura becomes a little more consistent. Now you'll notice over here. I have a column called putative Now it turns out that rather a large number of these and signs we've determined we've cloned and we sequence and as a result of those sequences we can now go into Gen Bank and into the sequence microbial genomes. And we can find examples of genes that we expect to be coding for restriction enzymes in the case of the type one systems some two thousand putative systems are found among sequences that are in general by this is among about twelve or thirteen hundred prokaryotic sequences engine bank and then just other sequences that we've gone in for other reasons in the case of the type two enzymes. There are. Some thirty six hundred shows some similarity to type two enzymes type two restriction enzymes or we can guess on the basis of structure protection but they should be restriction enzymes. But there's more than eight thousand D.N.A. methylation genes that we can find and I'll tell you why there is this discrepancy in the numbers in a little while but you can see this is a pretty large number and it greatly exceeds the numbers that have been functionally characterized and this is an important problem that I will get back to in the second part of the talk. In this case we have six hundred type three S. that we can recognize and here we've got some fourteen hundred thought forms that we can recognize but it's really these types of foods that have consumed most interest because these are the ones that make for good religions for bio chemistry and you can imagine with some three thousand represent more than three thousand representatives they come in a variety of flavors. So there are some that are just the simple two genes a restriction enzymes gene and the methylated gene. Some have an additional gene with them which we call a controller gene and this C. gene controls the expression of the restriction enzymes gene so that if the whole system is going to move from one bacterium to another as these systems often do you can imagine if you move both genes into a naive host and express the methylation expressed the restriction enzyme because a new host will have thousands of sites typically for the restriction and so on. You just can't protect it in time and so what this controller gene does is holding check the expression of the restriction enzymes gene until such time as the new host is completely methylated I could tell you a lot more about that but that would be a whole seminar all in itself very interesting set of genes here. Now so many times recognize asymmetric sequences if you notice these two sequences are symmetric if you write the sequence five to three on one strand five to three on the other it's exactly the same. And so that means a method that would recognize say this a residue will also recognise the a residue on the other strand. But in this case where you have an asymmetric recognition sequence. You have to protect both strands to protect the D.N.A. after replication and in this case you find two genes two methylation genes next to one another. Occasionally these refused and so this becomes very nice when you're trying to analyze the genome sequence because you can tell when you're looking at an enzyme that is likely to be recognized in an asymmetric sequence because it comes with two metal is genes. Sometimes the restriction enzymes gene comes in two parts as in this case here. And again one of these units will come one strand sequence and one will cut the other strand of the sequence is an example of a system that inherently looks like a type one system. It's got both M. S. and R. subhuman it's also got one of these controlling sequences. But it recognizes a symmetric sequence one of the more interesting classes is one of these are M. systems in which both restriction and modification are encoded in exactly the same gene. This is what we call the M M E one family. There are something like two to three hundred of these that we know about at the moment and they're particularly interesting because they're one group of restriction enzymes where not only have we been able to work out how the recognition takes place and then to alter it by genetic engineering so that we can now make a large number of enzymes that recognise sequences of general asymmetric structure but we've also. When able to look at music once there's a make a sensible guess about what the recognition sequence is going to be and that turns out to be useful to. Finally die and here we have a case of an enzyme H.P.A. to that has this extra gene it's called a v gene and this is a part of a repair system. So now it turns out that this method transfer is forms five methyl C. and five natural C. is inherently mutagenic and the problem with that is that if you're going to be constantly mutant uniting what happens five methyl see just the emanates very easily a mix T. and if you didn't do something about that. Then after replication you would get a mutation and it turns out that there's very gene is a mismatch recognizing enzyme that recognizes when you have a T. G. mismatch specifically within the context of this recognition sequence will cut their teeth containing strand and thereby open up the mutant in order to get it repaired and so this restriction system realizing that it's inherently causing a problem brings along some repair machinery to go with it. Now there are three kinds of methylation that we know about protecting zines protect D.N.A. against restriction enzymes five metal sites a scene that I just mentioned in which the metal group is here at the five position on the aromatic and then two extra extra bring their external methyl groups. So in four metal cytosine and six metal Adney in which this metal group is on this extra X. so cyclic aiming struck up here these turned out to be very easy to make it's an easy character or chemical reaction to carry out to put a metal group onto an external A mean. Whereas this is a very difficult thing to carry at. These three groups all have a common. If you look at the hands arms that produce and by all have some similarities in terms of the sequence of the enzymes. But it turns out that enzymes that do this function a very easy to recognize by bioinformatics But once the this or a little more difficult and in particular we still don't know how to differentiate just by looking at sequences whether an enzyme is going to be an end for methyl C N sign or an end six methyl a enzyme So I want now to talk to you about how we go about analyzing genomes in order to find new restriction systems. So the any genes a really easy to find because they have good sequence multi-faith within them and you can think about this in terms of the fact that they do two things these enzymes first of all recognize D.N.A. but they recognize a lot of different D.N.A. sequences and so they will have regions within them that are very variable because they have to do with sequence recognition but they also have a section of the gene protein that is necessary to do the chemistry and these are concerned this is the same. And so that is makes them pretty easy to find. Now these as some of the genes the acid the specificity subunits of the type one enzyme and the V. genes are the genes that do the mismatch repair these are pretty easy to find and so when you see one of these you can think. Well yes we must be part of a restriction system. Do you see genes the genes that control expression of restriction enzymes. Some are quite easy because there's one big family of these things the we've located. But then some are quite difficult. There are a few that look as though they've evolved in some separate way. And finally the things we're really interested in the our genes the restriction enzymes genes are very difficult to find and the reason that they're very difficult to find so I skipped that slide the reason that the. To find is that they're evolving very rapidly and it turns out that even when you have two restriction enzymes genes that recognise exactly the same sequence very often they show very limited sequence similarity in some cases to a point where you know you would never guess that they were actually recognising the same sequence and this is because that part of host parasite interaction constantly being challenged by you for ages and also they don't have any direct selection pressure on them except when there is a challenge by afraid. And if you happen to be in a situation where you're in an organism with many different restrictions systems and I'll show you an example of rat in a moment provided one of the restrictions systems is active and can dispose of the phage coming in. Then there is no pressure on the other systems to maintain their activity. And this is reflected by the fact that they're changing extremely rapidly. Now if we go through and look at the sequence microbial genomes as of yesterday there were twelve hundred and eighty five bacterial genomes for which we have complete sequences and I mean closed. Sequences not just shotgun data we have ninety three are calle sequences. And among all of the eleven hundred fifty five of them have at least one restriction system in them. The ones that don't tend to be bacterial endosymbionts things that live inside host cells many examples of this. They apparently don't face problems for from Fader's or if they do they find out of ways of dealing with them. Now a lot of what we do is to go. Sorry I went the wrong way that a lot of what we do is to look at the newly sequenced genomes and see what we can find by way of restriction says. Thems and a number of years ago Helicobacter sorry Helicobacter pylori was sequenced and it was discovered that in this organism there are no less than twenty six different restriction systems present and for some reason this organism is just taking up restriction systems and what I show here is not all twenty six but just type. Two because these are ones in which we actually cloned every single example of these systems and tracked Frank tippity and what we found was that among these systems. There were six that had active prescription enzymes associated with them but were eleven that had active methylated genes associated with them and the rest of the ones that are active are the ones that are labeled H P Y A five H P Y A three and so on. They have a peer after them. If that putative that is if they were just identified by Bioinformatics. But now a lot of Helicobacter pylori strange different strains that have been sequenced. And so far no one has found a strain in which the two strains have exactly the same complement of restriction enzymes being expressed and methane transferees is being expressed. And so it may be that in some way these provide an epi genetic mark in the case of bacteria to somehow prevent somehow identify the species and prevent cross cross breeding of any sort. So this is rather an extreme example it's not the most extreme example but there are a lot of organisms that would have four five six restriction systems in them. So this gets to be quite interesting. It also makes life a little more complicated and it also explains. Why these enzymes are evolving so rapidly because in a case like this you know if this enzyme is sufficient to actually blog entry by the face. Then there's no selective pressure on the others and it's only when a freight comes in. That is modified against all one of the systems that you then have to worry about that system being active. It also turns out that a lot of these systems that have just active methylation genes have right next to them a restriction enzymes gene but is one mutation away from being active and in two cases we've shown that directly that you can reactivate by a single mutation. OK So there's a slide that was missing. I just put it in the wrong order. So I've already talked about that we can press on. So now what I want to do is to show you the results of an idea that I had while I was taking a shower one day one of the things that I think is kind of fun about bioinformatics is that there are all sorts of ways in which bioinformatics can do things that are a lot more interesting just looking at sequence comparisons and it occurred to me one day as I said I was taking a shower at the time but when you do a shutdown sequence experiments so you know I clone a bacterial genome I do shotgun sequencing of everything. Typically you clone two to three killer bass fragments you take five hundred base pairs reeds from each A And so the question is what happens to fragments. But contain intact non-clinical genes such as restriction enzymes genes. So you know let's say one of these two K.B. pieces had a restriction enzymes gene and it's well when that goes into a colon and expresses it will kill the colon. And so any such genes should be missing from the set of clones. But to find the shotgun. That's illustrated on the next slide where you can see that if you've got a clone that runs up here just goes a little way into the restriction enzymes gene then it should be perfectly OK there's no reason to think it's going to have a problem but as soon as you start to take clones that start over here and run into the restriction enzymes gene and contain the complete gene they're going to be dead and they're going to be missing from the shotgun sequence set. And eventually you'll get to a point where perhaps you can clone something from the middle of the restriction enzymes gene and now it will proceed perfectly well and of course all the ones down here will be OK too. And so we thought well I thought what we could do is we could look at all the shock and sequence data and see where there were gaps in the coverage and in principle those gaps should correspond to lethal genes and in particular in this case the restriction enzymes genes and they should tell us when restriction systems were active. Unfortunately there was plenty of data for this marvelous influenza was the very first genome ever sequenced it was done by Craig Venter they had the original data they had to go down in the basement and gather it got me the original data and we started looking and what I've done here is just to plot. The position of the five crime end of the READ going from left to right. And these are the two genes down here here's the restriction enzymes Gene Here's the methylated gene and what you can see is that upstream of the restriction enzymes gene. There is a big gap because any clones starting in that area would have gone straight through the restriction enzymes she would have had it intact and would be dead and if we do the same thing on the other strand we see here that there is also a gap upstream of the restrict. An enzyme Gene except for this one guy here. And when we looked indeed what was in that particular read it turned out it was a crime. Eric. Clone that had a little bit of sequence from here but the rest of the sequence was from somewhere else. Elsewhere in the genome. It was just that when they did the original shotgun. Two pieces got fused together. So we looked at a large number of genes a large number of genomes where the data was available and where we knew what was active and what wasn't. And it turned out this was a very nice way of finding active genes. So the next thing we did was to go to a strange way we never looked before where no one had ever done the biochemistry and we looked at the math and a carcass no metal a carcass in this case a metal Akaka strain and bioinformatics we could see there were three nice genes up here there's a methylated gene this at the time was just an open reading frame that didn't look like anything else in German bank and then down here was one of these very genes the thing that is responsible for mismatch repair and when we look at the reeds you can see down here there is a nice gap going from left to right. Upstream of the restriction enzymes gene and on the other strand there is a nice gap. Upstream of the restriction enzymes gene in the other way so we were pretty certain this was going to be an active restriction enzymes gene. So we clone did that and we tested it. This was just we actually tested it in vitro transcription translation we just amplified a little bit of D.N.A. that contained the gene put it into an in vitro transcription translation system added some land to D.N.A. and ran it out on a gel and you can see here you get a nice banding pattern. This one is actually the results of taking just this in vitro transcription system from the caucus gene over here we recognize this pattern actually And so over here is the pattern you. Get from an enzyme called B S S H to a known restriction enzymes. And in the middle is what happens if you mix the two just to show that these two banding patterns really are the same. And if they were different. You would get extra bands in the middle. When we looked to the side of cleavage we discovered that whereas B. S.S.H. two left for prime four base three prime overhang this one left the two base three prime over. Sorry the five prime overhang to base three prime overhang. So the two ends are to have no sequence similarity they don't look anything like one another and in fact we found a number of other you restriction enzymes doing exactly the same technique. So I thought this was kind of a funny way of using bioinformatics to to do things that were not immediately obvious and yet actually you get functional information out of this kind of data. Now unfortunately everybody's discovered always hard throughput sequencing in cloning any more. And so the amount of data available for this is gong. So we can't keep doing this for a while it was pretty good. So now what I want to do is to talk to you about a new project is just getting started and it's called com breaks it's called computational bridges to experimentation and it sort of builds on something that has been a focus in my lab for a long time and that is I like to do bioinformatics but I like to sort of make predictions and then test them I like to link the bioinformatics to the biology and I have an experimental lab in addition to several points from the Titians who work with me back in two thousand and four I wrote a paper a little commentary that was published in past biology. And it was basically saying that we have a problem at the time there was a lot of sequencing taking place we would gathering D.N.A. sequences galore. And what we were finding was that there were high. Hundreds and hundreds of genes that you could identify using Mark's programs and other people's programs you could clearly see here is a nice coding region. We have absolutely no idea what those coding regions do. And so my proposal because there are no high throughput methods that we know of to do this my proposal was that we do a pseudo throughput method and we achieve high throughput by paralyzation and the idea is you get all the bioinformatics that you can who want to collaborate on this to make predictions and put them into a big database and so we have a great big database of predictions and then we recruit biochemists to come along and test some of those predictions elected ones things that are going to give you a lot of value in terms of later being able to do bioinformatics or later being able to understand the biology of the organism and in this way we would build a set of genes for which we had well characterized experimental functions for instance it's already known there may be among the genomes that we now at the moment there are plenty of examples of genes that occur in five hundred six hundred of these organisms and it's clearly it's the same gene. We have no idea what they do. Absolutely. Unknown function and so this is a way of trying to find out what the function is to expand our knowledge so that perhaps ultimately one day we really can do systems biology because we know what all the gene products are doing. So I say this was proposed in two thousand and four I got a very positive response from an I age they said let's organize a little conference talk about this and see how we can make progress go forward. So I did that in late two thousand and four I organized a conference down in Washington all of the key people from and I came to this conference of all people associated with the funding agencies. But the problem was that what I wanted to do was. To match up expert biochemists with particular predictions that lay right in their area of expertise and in such a lab. It doesn't actually take very much extra money to support a student to come in. It could be a rotation students could be a graduate student could even be a high school student in many cases but or even an undergraduate to come in and just sort of you know make a little bit of the dream product put it through the ass eyes that are well known in the lab and find out whether this prediction is correct or not because at the moment what happens. Loads of people are making predictions. No one is testing them part of this you know biochemist don't like to test other people's predictions particularly. So the idea was well what we'll do is we'll give a small grant to the lab. We'll keep them five ten thousand dollars just in order to test this particular prediction. So that was the essence of this proposal Well nothing happened until two thousand and eight late in two thousand and eight when and I each in their wisdom decided I should as a result of this all appear always funded said well you know you're not asking for enough money we don't know how to give away five thousand but we know how to give away two hundred fifty thousand dollars but we don't know how to manage your portfolio as small as this. And so nothing happened. I couldn't get them to do anything so in two thousand and eight. Finally an R.F.P. request for proposals came at tackling exactly this problem. How are we going to do annotation better because even and I finally realized in light of the numbers. You will see here this is the growth in the number of microbial genomes coming at this is two thousand and four and here when I originally wrote my proposal. They're getting worried they're spending all this money on sequencing. But the functional annotation of those sequences is lagging way way behind and you could at this point I think make a strong. A swell was a point in sequencing anymore genomes if you're not going to try and work out what the D.N.A. is doing. Why bother gathering more D.N.A. sequence. Well you know obviously D.N.A. sequences don't find that a very popular idea. And so they felt I had to do something about this functional annotation problem. So I put out this request for proposals we were applied to it a group when I say we aren't talking about two Boston University where I have an adjunct position and I and wrote a proposal that essentially flashed at this original idea that I've had in the plus biology article. I just want to illustrate the problem and it really is significant. So mufflers influenza was the very first genome ever sequence this was done back in one thousand nine hundred five. And if we look at the content of the genes you discover that of the sixteen hundred fifty seven genes that have been identified. There are twelve hundred twenty eight that are either now because someone's actually done some experiment on them or you can make a good prediction. Now what do I mean by a good prediction multiple clear I mean a prediction that is based upon some computer looking and saying here's a gene is ninety percent similar to this gene and another organism which has been annotated in some particular way the hydrogenated it's an oxidizer whatever it is and the substrate is known. However there is a problem. So what happens very often with computational predictions is that you've got a gene that may or may not be what I call a gold standard Gene as you will see in a moment where some point someone actually did an experiment. And you find another gene that is ninety percent similar to it. So you say this must be the same and so it gets annotated in German bankers being the same and at some point. Things go wrong. This gene that was originally ninety percent similar to the gold standard gene one where the function was done. Somebody finds a gene that is ninety percent similar to that and say it must have the same function too. And so what happens is the quality of annotation tends to go down because almost all of the annotation is being done by computer programs not by people computer programs are only as smart as you tell and they should be very often things that ninety eight percent identical can actually have different substrates. Many examples of glucose or hydrolysis things that cut up sugars. But you missed a couple but you know acids difference and all of a sudden the sugar that it will cut next to it has changed. Instead of recognizing right. Recognizes silos and so on and most computer programs that are just doing all of this annotation these days don't recognize that you know ninety eight percent similar wonderful it must be the same. And so there is a lot of wrong and attention in G.M. bank some people best are made and there might be as much as fifteen percent of all of the annotation in Jenin bank is wrong. You cannot be trusted. The bottom line is you really cannot trust it. This point since we don't know which fifteen percent that is. So evil these twelve hundred twenty eight genes where there are good put Asians. There's a fair chance two or three hundred of these are wrong. Also the we don't really know what they do. Now what's left is four hundred twenty nine are known these are genes that we have no idea what they do. We're told they're labelled as being hypothetical protein. There is some computer program of one of Mark's programs came along and said this is the start of the gene and here is the end of the gene and you look and sure enough makes a perfectly good proteins for. Fifty amino acids long looks like a genuine protein. We have no idea what it does it could be similar to something else in twenty other genomes probably don't know what it does in any of those genomes either. And so that is a bit of a problem and then there are others that are sort of hypothetical concern. So that's what this twenty six percent is and this number is actually very close to what it was back in one thousand nine hundred five when the genome was first sequence. There's not been a lot of progress in understanding what the genes are coding for. In some of these other organisms the numbers are higher than a carcass was the first our kale genome done thirty five percent. No idea what the genes do. One of the more recent ones is Pseudomonas aeruginosa where we have forty two percent of the genes. Almost half of the whole organisms genome. We do not know what it is no idea what it is doing and yet for many of these genomes people are trying to do what I call systems biology. They're trying to work hand computer programs that will tell you how these genes are working how the organism is working. I think it's not doable. The moment that is just simply not doable but there's no reason to think that it couldn't be doable and I think we should focus on at least a few of these organisms working out exactly what all the genes are doing and that means doing two things. It means. First of all working at biochemically that the genes are capable of and then ideally genetically putting it in within the context of the working of the whole organism. But the thing that is easy to do is the biochemistry the genetics Often all it ever does is to give you a hint as to what's going on maybe give you a prediction as to what this gene is doing it's not often that the genetics absolutely nails the function biochemistry doesn't a over function it nails the in vitro function doesn't. Tell you what pathways it may be participating in and for that you ultimately will need to do with genetics. But very often just knowing the biochemical function can take you an awful long way towards understanding its role within the cell. Sorry wrong way again. So here's the cause of the problem you know you hear a lot about how great it is to be able to sequence D.N.A. for less and less and less. This is the cost per base pair. The number of base pairs you get per dollar and you can see what's happening that the number of base pairs you get per dollar is just going up and up and up is soon going to be right up to the sky. It's amazing how cheap it is now to get D.N.A. sequence and so of course you know it's simple to do so everybody does it but I ask myself what is the point I ask you what is the point in gathering all of this sequence. If we're not going to try and work out what it's doing if we're not trying to try to understand the meaning the biological meaning less so why has there been so little progress. Well there's a bunch of interesting reasons one is the inherent difficulty of the problem if I give you a protein of three hundred fifty amino acids long and I say hey I'll give you a hundred thousand dollars work. What it's doing. You can't do it all you can do is sort of Asif everything that you think it might be doing if you're lucky you'll hit it. The odds are you're not going to hit it and so we definitely need to find ways of making reasonable predictions about what it's doing does it look as though it might be interacting with accidental some of finding does it look as though it's going to need A.T.P. does it have motifs in it suggests that it might be involved in this and this is where the bio information is can come in and there are many ways in addition to just looking for motifs or just looking for a sequence similarity that you can make predictions I showed you one of them in the case of the restriction enzymes genes where you could act. We identified genes that were lethal from data that was carried in some completely other way and there are many other ways you can perhaps identify operands and say Here's a bunch of genes five genes that clearly all involved in human bio sentences but there's one open reading frame here right in the middle of this opera and we don't know what it does. Can we perhaps make a good guess as to what it might doing. We could guess it's involved in here by a sentence or two in some way. And so maybe there's a biochemical step involved. For which we don't have a gene you can go and test it. So there are many ways in which you can do this but it typically takes a combination of biochemistry maybe some genetics. Maybe some computation to begin to get a handle on this problem and not a lot of people have all of the skills. So this cross disciplinary set of skills. That's needed to really make progress does not reside in very many labs a lot of labs just focus on computation others just focus on biochemistry and a lot of them never talk to one another and in fact a lot of chemist don't want to talk to the bio information because they think they're very upper to make it way too many many predictions the other thing that's happened recently is there's this tremendous appeal for genome wide studies everyone wants to do something where you look at the whole genome all at once you know let's not worry about these individual things let's look at everything and in fact it's becoming increasingly difficult to get grants in areas where genome wide studies are popular. If all you want to do is actually look at a few genes study sections say why don't you look at the whole genome Why don't you do micro rays Why don't you do this other. Well unfortunately there are no genome wide ways other than computationally of getting this process this problem to function and then finally there's a lack of appropriate funding mechanisms you cannot go to and I can say hey I found an interesting gene here it's present in five hundred genomes. I want. Work out what it does no one will give you money for that you simply can't get the money to do that. So this is where I come bricks comes in through put by paralyzation so we get the bioinformatics sions on board to make high quality predictions things that really say. I think this is an oxidizer and I think the substrate is likely to be a sugar or I think the substrate is likely to be a turkey and I think it's likely to be an alley frats it can mean an hourly phatic substance of some description so basically we want to get some high quality predictions we want to assemble a database of those predictions and we're in the process of doing that at the moment. Biochemists can then come and test the predictions and initially we will approach biochemists when we have a good set of predictions where on the Web site assembling a list of what we think are the top one hundred per predictions things that if these predictions turn out to be accurate and can be tested and shown to be accurate then that will be very helpful to the information in making better predictions later on and so that is something the an important criteria here because you know you write an algorithm you make predictions. If old predictions are correct you think are the algorithm is great but if half of them are wrong then you've got to go back and readjust your algorithm so that you can get the right answer. More often. And the idea is that by getting biochemists and having them work in their field of expertise the incremental cost of doing this work is really very very slight. And so even though it's quite sly you still have to persuade them to participate so we offer them some financial incentives to do this given five ten thousand dollars and the way one can think about this going. Is that maybe there's a student here at Georgia Tech maybe one of you in the audience who says. I wouldn't mind doing this as a rotation. During my first year as a graduate student and so I will go and look in COM bricks and I'll see where is an interesting gene something we don't know the function for there's a very good prediction you look you find it you say all is a professor at Georgia Tech who is actually working in this area. And so between as we can apply to combat for a grant perhaps get five or ten thousand dollars so that I can go and test the prediction in this person's lab. So this is very good for the lab to get a little extra money. Very good for the student because if in fact they can validate that prediction. You can publish that result will be a publishable result and in the process you will help both the bioinformatics and the biochemical communities so you do a very valuable function a very valuable service for biology as a whole and the nice thing about this is that there are so many of these genes I mean there are thousands upon thousands of these genes but no one has to do exactly the same training everyone can do something different. You can all make a contribution and it can be really very valuable and so we're hoping to get a lot of momentum within the community to do this a lot a lot of opportunities a lot of possibilities. So one question that comes up had can we improve the predictions Well I think one of the things is we should recognize the vast bulk of the current annotation is really based on sequence similarity to previously annotated genes. This is OK but there are some problems with it that I will point out in a moment. We have to recognize a lot of the annotation is wrong for exactly this reason I quoted earlier that something is ninety percent some similar something we know the function of and then something is ninety percent similar to that and ninety eight. Scent similar to that before you know it you're ten percent similar to the original gene and the chances of it being correct a zero so. We have to do provide a much better basis for similarity based annotation. That will be the subject of the next slide and then we will provide within the complex web site a means for reporting Messina tensions. What. All of this boils down to is that we need a gold standard set of genes and proteins we need to have this set where we can say we know the sequence of this protein. We know the function of this protein because the biochemistry was carried out on this protein with this sequence and then have a standard we have a gold standard against which we compare every new gene that comes into the database we can say how far away is it from the gold standard and at some point it's sufficiently far away you say well maybe we better just recheck make sure that it still has the same function it's eighty percent similar maybe we should check maybe it's not really got the same function. So the question is Where would you go with such a gold standard set where you might think you know we've got big data massive databases we've got gem bang we've got the D.N.A. database in Japan. We've got you know prop which is supposedly the database for proteins. We have the whole I Database Gene both of which have huge lists call I.D.'s and functions associated with them and you might think. But these would already be gold standard databases they're not. If you go to any of the right databases and you can we find out for this gene where this function was determined was in fact the gene sequence upon which this function was determined we discover you can't even find that information. This is not something that needs databases of considered worth tracking and what's happened in the. Color is that perhaps someone in some in color strange determined are this is the D.N.A. methylation D.N.A. methylation E. coli. But perhaps the function was actually tested in a strain that was a long way away from anything that's been sequenced and in fact the gene was never sequence to the time. And so on and you discovered a large number of the whole of proteins fall into this category the sequence that is in the databases this ecology in may not be the one where the function was determined and so and the same is true you know crowd and in fact I can tell you all of these databases are rather embarrassed by this fact because when I asked them this question everybody said no we can't provide that. And when I proposed this idea for a gold standard database everybody said What a great idea. We should get this. And so now they're all busy cooperating with me and with Combray some with us to actually assemble this data set. So this is the essence of it but each protein in the set. We're going to find a reference that describes the determination of function going to make sure that the strain is well defined and noted. And we're going to make sure especially that the protein sequence is known for this protein in which the function was determined if it doesn't for if we can find this match between a protein sequence. It will not go into the gold standard set. So we're putting this together it will probably take a couple of months to get a reasonable start on this because both N.C.B.I. And you know products are busy working out procedures to make this easy but we have got started on this. We have about a couple of thousand candidate genes and for each of these genes that being sent out to manual curators who would go along. Through literature and I actually find whether or not the literature supports the assignment. So we identify candidate side from databases or from individuals and we welcome anybody who wants to submit a gold standard what they think should be a gold standard on the complex web site. You know prot will then make sure that there's a unit product number for this they will go through their annotation and prepare a template that in principle contains reference to the sequence reference to the function and a few other things about the strain designation and so on and then these are being sent to curators this is something we're doing as part of the complex project at the moment but ultimately will be taken over by and C.B.I.. And what we're asking is just the curators manually check the strain information is accurate and sometimes people say well you know we did this on the color line. Well that becomes unknown strain. It's OK that's fine but one should not therefore assume that it's the same as something that is in the sequence databases. We want to make sure that the biochemical characterization is accurate description that is in unit really matches what was done and then the fact that the gene was sequenced and hence the protein sequence is known and from exactly the strain in which the biochemical characterization was done and I just put a note here you've got to know you know the sort of the gold standard sequence we call lions for a strain M G six hundred fifty five. There has not been very much characterization done on genes from the strain so it's always important to know where it was to make sure the sequence is the star the same as M G. Sixteen fifty five before the annotation is put in as gold standard. So how can you help. Well there are many ways in which people can help you can send in candidate gold standard genes and proteins you can volunteer to serve as a manual curator you just. Rich Roberts Roberts said anybody dot com and you can browse the complex web sites and let us know if you see genes should be gold standard the complex website is complex you got edu again if you google complex you will find this. Now how are you going to prioritize Gene since I told you you know there are hundreds of thousands of these things. Well the first thing we want to know how many organisms contain this gene for which there is a prediction and the larger the number of genes and especially if it transcends kingdoms the implants if it's in animals then that gets to be a lot more interesting. So if any gene that is present a large number of organisms Obviously if you can work out what it does you get a lot of bang for the buck. We're also very interested in trying to finish the functional annotation of E. coli and so if the gene is currently unknown in color and if there is a good prediction and that would be a high priority for testing. Something for Helicobacter pylori we wanted to show that this approach could be used to annotate pathogenic genomes obviously enter and I aged very interested in pathogens. Aleko back to Laura a very interesting. This is the organism the course has also but it's also one of the few bacteria absolutely known to cause cancer. It causes stomach cancer. It also causes this off. Adriel cancer. It can be grown in the lab and it's actually relatively easy to work with. Is there a clone or purified protein available the structure of you know makes initiative has actually generated many thousands clones of interesting unknown open reading frames that are present in G.M. by and the idea there was that they would say well let's make this protein Let's try and crystallize that if we can get a structure then perhaps we can determine the function well it turns. When you try that approach about one in ten of all the proteins you make actually crystallize and then when you get the structure. Most of the time you haven't got a clue about function structure does not easily give rise to function but all of these clones and all of these proteins were viable and we can provide those we can tell you how to get those and then there's a whole bunch of other criteria I don't want to go into so there's a whole list of things that make it interesting for particular proteins. Who can help. Well the computational biologists and the biochemist obviously also geneticists can help. Finally university students high school students who are actually going to run an experiment with our local high school to have their students come in to buy a labs and do some of these tests as part of a high school fair project. I was at the Intel Science Fair earlier this year and I can tell you the kids who were participating there from all over the world are more than qualified to do these kinds of tests if they go to a decent lab and get some supervision and I think you know you could easily go and find high school students here who could come and work in labs at Georgia Tech that would be straightforward. You know here professors there are a lot of professors who retire and go emeritus and they love doing experiments. Many of them take the opportunity to go back to the lab. Not looking at anybody in particular but I think there are some opportunities here to find finally of course we need help from the funding agencies and I've been talking not just an eye to N.S.F. to him. I have a Hughes for the Welcome Trust in England a lot of them of expressed interest because this can actually be a wonderful teaching tool. There are many ways in which you could build this into undergraduate education and many ways in which you can run some of courses in which you get students to participate in something like this and a lot of times you don't need a lot of funding you know. To get it going. We're hoping to make this into a worldwide effort. It's an ideal project for developing countries where they don't have a lot of money for science but often there is a lab that has a lot of expertise in an area and can do this and can use this to train students. So I think there are lot of possibilities for people to get involved in this. And of course you can help. These are the people that are involved myself Simon to see if who is a computational biochemist than a Boston University a guy who came out of computer science and moved into Bioinformatics. Martin Stephan is a bio chemistry and he's looking after a lot of the biochemical side of things. Charles de Lisi is an emeritus professor be you. He was the guy who first provided funds for the human genome project when it got started. Daniel said Gray is a young assistant professor at B.U. who looks at networks he's interested in networks with in bacteria and other organisms. Stephen Saltzburg is a well known computational biologist at the University of Maryland. Denis picked up again another well known Byron from attention at Columbia. All of our collaborators we already have a lot of collaborators both on the computational front and on the biochemical front. We have half a dozen projects already funded to get going on this and of course we're very grateful for funding from the and I ate they gave us a to get Grant on the stimulus money so this money is going to run ads and we're busy working out how to get them to give us a lot more because obviously we're just at the start of this but I think the very exciting project. It's something in which we can show that we as biologists really can collaborate because however you look at it. Science is a collaborative activity you know everything we do builds upon stuff that's been done elsewhere. And I can tell you it is a lot more fun to collaborate than it is to complete a long standing collaboration with my colleague here at the front. Really and crystallographer we've worked together for many many years and it's so much more fun to collaborate than it is to try to compete with people and I hope that all of you will but at least think about this and see if there's some way in which you can help and collaborate. Thank you thank you very much for the inspiration to get the ball. I'm going to buy over the place in sequence is the book to me which is increasing all your problems and you should see some problems. This is your question my letters for you with my eyes help you need to construct to convert double standards or anything else. Even your heart. I read well. So I think first off let's say you have a protein. If you have a gold standard. There are many ways in which you can say how far away it is so obviously just a simple statistical test and many amino acids are identical but also you can look at motifs are key motifs conserved in particular if you know that within the gold standard. There is a little area that forms a catalytic sites all such substrate. Condition site and you can see that that is different in this new gene there already. You can begin to say well you know maybe this new protein doesn't have the same function maybe it's different. Many of the programs that are annotating genomes at the moment don't even do something. The simplest and so I think the idea of providing the gold standard is that it will really give the annotator. The opportunity to do a much better job of annotating things because they have something to relate it all to and that's not what is happening at the moment. You know there are more of one on one. So we don't want to call it works you call a cute little guy so expression is actually a lot easier than it used to be but there are a lot of tricks and I think one of the key things is you really have to keep your eye out for where people have come up with good traits now one of the things we're focusing on bacterial and genes not doing any eukaryotic genes at the moment in this particular project. Partly because of expression but partly because you carry out it genes are so damn difficult and complicated even predicting where they are can be a big problem. I mean I'm sorry I discovered this split gene thing because it just complicated biology enormously. But I figure if we can really get everything going with bacteria or an R. K. or and get a good group of people together and get everything going. We can at least show that it will work and then we'll expand it into you carry out later on so that is the idea at the moment. So call out in which we know that you know it's nice that in message. You know. So we're not looking for cellular function. All we're trying to do in this particular project is to get at the biochemical function because if you want to go beyond that you're talking about really a huge amount more work and not something that can be funded for a spall amount of money. What we're hoping though is that what will come out of this is that people will find some new functions that can then lead them to right are ones to explore it and find out about the biology what that OK. And what's right. So for protein protein interactions I think we would be satisfied at this point if we could identify the binding partner. So for instance if it's a transcription factor. What is the D.N.A. sequence that it binds to if it's part of a multi subunit complex. What what is it a part of the right to some of proteins you know that part of a massive complex some of them have enzymatic function but a lot of them don't. But just the fact that they're a part of that complex is good enough for our purposes at this stage. So we're not trying to solve the whole of biology we're trying to do something pretty simple that will actually lead to better annotation of genomes so that's really what it's all about you know right. So I told you we have this class of enzymes that are recognized by represented by M M E one this is a single train restriction modification enzyme and in this particular case we already know how to do that we can make many changes specifically in that particular class of enzymes. We have another class of enzymes that we're just starting to make some progress on where we think we will be able to do the same. Sort of thing and that is our go buy a lab says not to just keep offering more and more enzymes. But to say you want an enzyme with a particular recognitions sequence specificity Telos and we will make it so this is our goal but most of the work up by a labs now is our aim towards understanding enough about specificity so we can make new ones by design. Yeah. You don't. OK Just a scratch I'm sorry this is working for first class restriction enzymes structures from you know for sure. Structures and watch. It's a very mature for the woman so that fall into about four classes you can classify them as four among the ones we know we don't have a lot of structures we have right things. Sixteen or eighteen structures at the moment and we certainly are certainly not fully representative of the range of restriction enzymes that we know about and in particular the thing that I would like to see some structures for examples of enzymes that recognise exactly the same sequence and cut it exactly the same point. And yet show no sequence similarity and we in a while you have one such example of that although it's not quite that Eco are one and bam H one recognise different sequences but the fall that almost super imposable the structures and the only three amino acids that they have in common are the three that form the catalytic triad that do the cleavage and that's not of any use in terms of predicting stuff which you are clearly one among young people five years we've got to see because they're pretty much reacting in a fine. Gesture to a late sense. Right right. And we have tried some of the. At some cases you can make predictions about what these things are going to recognize. But in general that is not that approach has not worked terribly well with the sequence differences are too different. So for instance I go I want Bama each one which almost coincide. If you try to thread one on the other it doesn't work the threading algorithms don't get it the same. So it's a problem structure protection in general is a bit of a problem if things are close and threading works very well but as you get further and further away. I mean lots of these have sort of you know four percent similarity of props and identity and that it's quite a problem. There's that we be glad to send you some if you want to try reading signals the functional size trying to use them. Why this mean. Terrible shock if you want to actually write but the problem is that the catalytic triad in Bam makes one an echo of one you've got one here one here one here in the sequence in space right space. Right. See that's the way it goes and so you get at least in theory which might be OK Well we'll send you some to try and there was you one question I think. Yes OK he's right. If I was running and I would do that but I'm not OK. Yes. Membrane proteins are a major. Well I'm from every conceivable function that you can imagine. So from a structural perspective from a functional perspective and I think at the moment we're going to leave the membrane proteins to the experts So for instance some of them. We know are transporters some are involved in transporting things to and fro and we really Milton Sawyer is a collaborator of Oz What was the last five so we had it was our distinguished lecture was a joy to my patients you know and the great life. So I'm one of those that go. The center of my life. You know that little pleasure and the order in the years I would just be so rich or nice as my life was ordered by the girls in my service or a symbol of our own British life. They should rather sharp and the better. THIS HOUR. Let me just tell you that you can be present by several just by this there is just let the conviction of your cross the threshold but just let the process of your life and there's a lot of it is also likely. You're on generic cause that doesn't go by this remarkable discovery the subject of last night and I really wasn't that little bit by Michael Jackson this out of the end he looked like you better go make a fresh original This is originally from work. Years ago. Yes you live in. You know just like you know any other day styles from Asian problem with George. They're the most prominent right. And that's what the issue of us got it from the loss of your shirt from his famous fantasy life and six years from both I did read it and greatest life and this is my sixth day just don't read it. Five years old at that point and I can see the change is this sort of role model but I always get there and you know you're standing right over your head of your city. Yes And then like that so was it just kind of let what's there. What's left of those five years of good life. You know I don't like this discovery was made possible. You're the liar. You said that I want to go by your methods of monies and by night when I go with you with my lips and was just as it was just like you said the people of biologists while they're very look to see what a science has been top of it was against somebody just of a small measure to the support of the leadership of. This is where you can actually hear this and it a good idea just to be Hannibal Lechter for that matter you should know better than the science and what made us the robots at the scene. Yes it was a bit of a guest. This is a fight with your own. Well first of all. Michael published a number of papers last year. Always disliked it just a couple but it's notices that he says that many of us adults. One of the biggest but he was elected the number of years. One was just in the south of that over was knighted by the News of the second episode that made it possible just standing there and this on among many others while it's not immediately. Just bad but is it all rich is it good music lovers. And now he's one of the you can easily get together with many social musicians Stuart Andrew Lloyd Webber sort of I'm joined by actually good at it as well as Sir Paul McCartney This is all my thing and not. Richard Roth because it's enormous on the sleeve and the play well thank you very much. It's always very embarrassing to hear these introductions and you shouldn't believe too much of it. What I'm going to try to do today is to give to talks in one and the reason for that is that we have a new initiative on the Y.. At the moment and I really want to solicit some of your help on this initiative but I also want to tell you what. Restriction enzymes bioinformatics and you know makes things that have been very close to my heart for many many years. So I'll start off by going quickly through the you know mix of restriction of modification and then move on to this new project which is called calm breaks which is all about annotating genomes. So let me just remind you of what the phenomenon of restriction of modification is so bacteria have constant problems because they're always surrounded by bacteria phages viruses that come and want to take them over. If they possibly can. And so bacteria have come up with a kind of immune system that will help them to prevent pages from infecting them and they do this by encoding an enzyme called a restriction enzyme represented by this little gremlin here with scissors and the idea is that this gremlin will come along this enzyme will cut every time it see that particular sequence within the D.N.A. in this case G T T C is the recognition sequence for Echo or one. This is one of the well known enzymes and the idea is that as the pages injecting its D.N.A. into the cell the restriction enzyme will cut it up into pieces and so will stop the infection. Now of course this could be a problem for the bacterium if it didn't have some way of protecting its own D.N.A. against the action of the restriction enzyme and so it has another enzyme represented by this gremlin which actually modifies this specific sequence that is recognized by the car one restriction enzyme it puts a metal group right on the second adenosine residue and in so doing this protects the host D.N.A. the. Bacterial D.N.A. against the action of the restriction and so on. And the bottom line is that you end up with a competition when a new phase D.N.A. is entering the so between the modification enzyme which is in the cell and has been keeping the host cell D.N.A. protected and the restriction enzyme which is looking for some unmet fellate the D.N.I. to cut a nature usually has stacked the deck so that what happens is that the math the laser is not a very good enzyme it's just not very fast and that's where is the restriction enzymes tend to be very very fast cutting and so most of the time the restriction enzymes gets to the site before it can be methylated cut set up and the cell is protected and this is very good for a number of reasons but most importantly it's very good because in the laboratory these restrictions and zines work very quickly. They're very efficient. They work for us. Typically you can get complete digestions in a few minutes with them and this makes it very easy to do biochemistry with them. So there's a lot of these things are now known and I'll tell you about some of these i just want to give a plug to my database called replace see if you google ribeye if you can easily find it and this is basically a database of everything that you might want to know about restriction enzymes. It's actually one of the very earliest of the molecular biology databases this was started in one nine hundred seventy five and has been continuously maintained since then. Now let's talk about the restriction modification systems because the enzymes have been so useful in the molecular biology lab and for the whole genetic engineering industry. What's happened is that there has been over the years a tremendous. Enzyme and as a result of this. Which is actually very easy to carry out all you do is grow a bacterium make a crude extract throw in some D.N.A. and then ask. Can the extract cut the D.N.A. into nice discrete fragments and you can display these on an eye gross gel very easy to find new restriction enzymes. And as a result of that most of the enzymes shown up here type two have actually been isolated and shown to be functional. So there are some more than thirty eight hundred of these and so I'm sorry been discovered during the course of screening both in academic labs but also in commercial companies including New England by a labs so there are very large number of these an unknown and they've been characterized in fact among the enzymes that have been characterized biochemically this is probably the largest single family many more of these have been found in any other enzyme you can think about and they've actually been characterized in the laboratory in terms of their recognition sequences. But we now know that there are actually a lot of different types of restriction enzymes we have four major types type Rawn enzymes that will recognize a specific sequence but they come randomly away from the sequence and they have some quite interesting bio physical properties but they turn out not to be useful as reagents to cut up the and I because of this random cleavage. There are three subunit enzymes and I don't propose to go into them in any detail other than to say that we have evidence for ninety four functional enzymes that is either genetic or biochemical evidence showing the active and that they would. The type two of the ones that I just talked about but you have two completely separate genes one code for methylation one code restriction enzyme. Type three enzymes also have two genes but in order to get restriction. You have a common. Nation of these two subunits the Reds and the mob subunits and just fourteen of these have been shown to be functional and finally there's a class here called Type four and these enzymes recognize methylated D.N.A. Now you have to realize that bacteria in code restriction enzymes are busy fighting for aging and so of course the for a huge fight back. And so what would you do if you were afraid well we figured let's modify our D.N.A. So the restriction enzymes won't recognize us anymore. And so in fact quite a lot of traders that modify D.N.A. by methylation or in other ways and in this way they overcome restriction barriers and as a result of that the bacteria then figure out well we've got to do something about these and so they developed enzymes that would specifically recognize modified D.N.A. And so that now a whole bunch of enzymes that we know that recognize methylated D.N.A. and there is a little bit of a man creature issue here because some of these and zines have been classified as type two and some this type for and that's something that I'm going to be doing something about over the course of the next year. So the nomenklatura becomes a little more consistent. Now you'll notice over here. I have a column called putative Now it turns out that rather a large number of these and signs we've determined we've cloned and we sequence and as a result of those sequences we can now go into Gen Bank and into the sequence microbial genomes. And we can find examples of genes that we expect to be coding for restriction enzymes in the case of the type one systems some two thousand putative systems are found among sequences that are in general by this is among about twelve or thirteen hundred prokaryotic sequences engine bank and then just other sequences that we've gone in for other reasons in the case of the type two enzymes. There are. Some thirty six hundred shows some similarity to type two enzymes type two restriction enzymes or we can guess on the basis of structure protection but they should be restriction enzymes. But there's more than eight thousand D.N.A. methylation genes that we can find and I'll tell you why there is this discrepancy in the numbers in a little while but you can see this is a pretty large number and it greatly exceeds the numbers that have been functionally characterized and this is an important problem that I will get back to in the second part of the talk. In this case we have six hundred type three S. that we can recognize and here we've got some fourteen hundred thought forms that we can recognize but it's really these types of foods that have consumed most interest because these are the ones that make for good religions for bio chemistry and you can imagine with some three thousand represent more than three thousand representatives they come in a variety of flavors. So there are some that are just the simple two genes a restriction enzymes gene and the methylated gene. Some have an additional gene with them which we call a controller gene and this C. gene controls the expression of the restriction enzymes gene so that if the whole system is going to move from one bacterium to another as these systems often do you can imagine if you move both genes into a naive host and express the methylation expressed the restriction enzyme because a new host will have thousands of sites typically for the restriction and so on. You just can't protect it in time and so what this controller gene does is holding check the expression of the restriction enzymes gene until such time as the new host is completely methylated I could tell you a lot more about that but that would be a whole seminar all in itself very interesting set of genes here. Now so many times recognize asymmetric sequences if you notice these two sequences are symmetric if you write the sequence five to three on one strand five to three on the other it's exactly the same. And so that means a method that would recognize say this a residue will also recognise the a residue on the other strand. But in this case where you have an asymmetric recognition sequence. You have to protect both strands to protect the D.N.A. after replication and in this case you find two genes two methylation genes next to one another. Occasionally these refused and so this becomes very nice when you're trying to analyze the genome sequence because you can tell when you're looking at an enzyme that is likely to be recognized in an asymmetric sequence because it comes with two metal is genes. Sometimes the restriction enzymes gene comes in two parts as in this case here. And again one of these units will come one strand sequence and one will cut the other strand of the sequence is an example of a system that inherently looks like a type one system. It's got both M. S. and R. subhuman it's also got one of these controlling sequences. But it recognizes a symmetric sequence one of the more interesting classes is one of these are M. systems in which both restriction and modification are encoded in exactly the same gene. This is what we call the M M E one family. There are something like two to three hundred of these that we know about at the moment and they're particularly interesting because they're one group of restriction enzymes where not only have we been able to work out how the recognition takes place and then to alter it by genetic engineering so that we can now make a large number of enzymes that recognise sequences of general asymmetric structure but we've also. When able to look at music once there's a make a sensible guess about what the recognition sequence is going to be and that turns out to be useful to. Finally die and here we have a case of an enzyme H.P.A. to that has this extra gene it's called a v gene and this is a part of a repair system. So now it turns out that this method transfer is forms five methyl C. and five natural C. is inherently mutagenic and the problem with that is that if you're going to be constantly mutant uniting what happens five methyl see just the emanates very easily a mix T. and if you didn't do something about that. Then after replication you would get a mutation and it turns out that there's very gene is a mismatch recognizing enzyme that recognizes when you have a T. G. mismatch specifically within the context of this recognition sequence will cut their teeth containing strand and thereby open up the mutant in order to get it repaired and so this restriction system realizing that it's inherently causing a problem brings along some repair machinery to go with it. Now there are three kinds of methylation that we know about protecting zines protect D.N.A. against restriction enzymes five metal sites a scene that I just mentioned in which the metal group is here at the five position on the aromatic and then two extra extra bring their external methyl groups. So in four metal cytosine and six metal Adney in which this metal group is on this extra X. so cyclic aiming struck up here these turned out to be very easy to make it's an easy character or chemical reaction to carry out to put a metal group onto an external A mean. Whereas this is a very difficult thing to carry at. These three groups all have a common. If you look at the hands arms that produce and by all have some similarities in terms of the sequence of the enzymes. But it turns out that enzymes that do this function a very easy to recognize by bioinformatics But once the this or a little more difficult and in particular we still don't know how to differentiate just by looking at sequences whether an enzyme is going to be an end for methyl C N sign or an end six methyl a enzyme So I want now to talk to you about how we go about analyzing genomes in order to find new restriction systems. So the any genes a really easy to find because they have good sequence multi-faith within them and you can think about this in terms of the fact that they do two things these enzymes first of all recognize D.N.A. but they recognize a lot of different D.N.A. sequences and so they will have regions within them that are very variable because they have to do with sequence recognition but they also have a section of the gene protein that is necessary to do the chemistry and these are concerned this is the same. And so that is makes them pretty easy to find. Now these as some of the genes the acid the specificity subunits of the type one enzyme and the V. genes are the genes that do the mismatch repair these are pretty easy to find and so when you see one of these you can think. Well yes we must be part of a restriction system. Do you see genes the genes that control expression of restriction enzymes. Some are quite easy because there's one big family of these things the we've located. But then some are quite difficult. There are a few that look as though they've evolved in some separate way. And finally the things we're really interested in the our genes the restriction enzymes genes are very difficult to find and the reason that they're very difficult to find so I skipped that slide the reason that the. To find is that they're evolving very rapidly and it turns out that even when you have two restriction enzymes genes that recognise exactly the same sequence very often they show very limited sequence similarity in some cases to a point where you know you would never guess that they were actually recognising the same sequence and this is because that part of host parasite interaction constantly being challenged by you for ages and also they don't have any direct selection pressure on them except when there is a challenge by afraid. And if you happen to be in a situation where you're in an organism with many different restrictions systems and I'll show you an example of rat in a moment provided one of the restrictions systems is active and can dispose of the phage coming in. Then there is no pressure on the other systems to maintain their activity. And this is reflected by the fact that they're changing extremely rapidly. Now if we go through and look at the sequence microbial genomes as of yesterday there were twelve hundred and eighty five bacterial genomes for which we have complete sequences and I mean closed. Sequences not just shotgun data we have ninety three are calle sequences. And among all of the eleven hundred fifty five of them have at least one restriction system in them. The ones that don't tend to be bacterial endosymbionts things that live inside host cells many examples of this. They apparently don't face problems for from Fader's or if they do they find out of ways of dealing with them. Now a lot of what we do is to go. Sorry I went the wrong way that a lot of what we do is to look at the newly sequenced genomes and see what we can find by way of restriction says. Thems and a number of years ago Helicobacter sorry Helicobacter pylori was sequenced and it was discovered that in this organism there are no less than twenty six different restriction systems present and for some reason this organism is just taking up restriction systems and what I show here is not all twenty six but just type. Two because these are ones in which we actually cloned every single example of these systems and tracked Frank tippity and what we found was that among these systems. There were six that had active prescription enzymes associated with them but were eleven that had active methylated genes associated with them and the rest of the ones that are active are the ones that are labeled H P Y A five H P Y A three and so on. They have a peer after them. If that putative that is if they were just identified by Bioinformatics. But now a lot of Helicobacter pylori strange different strains that have been sequenced. And so far no one has found a strain in which the two strains have exactly the same complement of restriction enzymes being expressed and methane transferees is being expressed. And so it may be that in some way these provide an epi genetic mark in the case of bacteria to somehow prevent somehow identify the species and prevent cross cross breeding of any sort. So this is rather an extreme example it's not the most extreme example but there are a lot of organisms that would have four five six restriction systems in them. So this gets to be quite interesting. It also makes life a little more complicated and it also explains. Why these enzymes are evolving so rapidly because in a case like this you know if this enzyme is sufficient to actually blog entry by the face. Then there's no selective pressure on the others and it's only when a freight comes in. That is modified against all one of the systems that you then have to worry about that system being active. It also turns out that a lot of these systems that have just active methylation genes have right next to them a restriction enzymes gene but is one mutation away from being active and in two cases we've shown that directly that you can reactivate by a single mutation. OK So there's a slide that was missing. I just put it in the wrong order. So I've already talked about that we can press on. So now what I want to do is to show you the results of an idea that I had while I was taking a shower one day one of the things that I think is kind of fun about bioinformatics is that there are all sorts of ways in which bioinformatics can do things that are a lot more interesting just looking at sequence comparisons and it occurred to me one day as I said I was taking a shower at the time but when you do a shutdown sequence experiments so you know I clone a bacterial genome I do shotgun sequencing of everything. Typically you clone two to three killer bass fragments you take five hundred base pairs reeds from each A And so the question is what happens to fragments. But contain intact non-clinical genes such as restriction enzymes genes. So you know let's say one of these two K.B. pieces had a restriction enzymes gene and it's well when that goes into a colon and expresses it will kill the colon. And so any such genes should be missing from the set of clones. But to find the shotgun. That's illustrated on the next slide where you can see that if you've got a clone that runs up here just goes a little way into the restriction enzymes gene then it should be perfectly OK there's no reason to think it's going to have a problem but as soon as you start to take clones that start over here and run into the restriction enzymes gene and contain the complete gene they're going to be dead and they're going to be missing from the shotgun sequence set. And eventually you'll get to a point where perhaps you can clone something from the middle of the restriction enzymes gene and now it will proceed perfectly well and of course all the ones down here will be OK too. And so we thought well I thought what we could do is we could look at all the shock and sequence data and see where there were gaps in the coverage and in principle those gaps should correspond to lethal genes and in particular in this case the restriction enzymes genes and they should tell us when restriction systems were active. Unfortunately there was plenty of data for this marvelous influenza was the very first genome ever sequenced it was done by Craig Venter they had the original data they had to go down in the basement and gather it got me the original data and we started looking and what I've done here is just to plot. The position of the five crime end of the READ going from left to right. And these are the two genes down here here's the restriction enzymes Gene Here's the methylated gene and what you can see is that upstream of the restriction enzymes gene. There is a big gap because any clones starting in that area would have gone straight through the restriction enzymes she would have had it intact and would be dead and if we do the same thing on the other strand we see here that there is also a gap upstream of the restrict. An enzyme Gene except for this one guy here. And when we looked indeed what was in that particular read it turned out it was a crime. Eric. Clone that had a little bit of sequence from here but the rest of the sequence was from somewhere else. Elsewhere in the genome. It was just that when they did the original shotgun. Two pieces got fused together. So we looked at a large number of genes a large number of genomes where the data was available and where we knew what was active and what wasn't. And it turned out this was a very nice way of finding active genes. So the next thing we did was to go to a strange way we never looked before where no one had ever done the biochemistry and we looked at the math and a carcass no metal a carcass in this case a metal Akaka strain and bioinformatics we could see there were three nice genes up here there's a methylated gene this at the time was just an open reading frame that didn't look like anything else in German bank and then down here was one of these very genes the thing that is responsible for mismatch repair and when we look at the reeds you can see down here there is a nice gap going from left to right. Upstream of the restriction enzymes gene and on the other strand there is a nice gap. Upstream of the restriction enzymes gene in the other way so we were pretty certain this was going to be an active restriction enzymes gene. So we clone did that and we tested it. This was just we actually tested it in vitro transcription translation we just amplified a little bit of D.N.A. that contained the gene put it into an in vitro transcription translation system added some land to D.N.A. and ran it out on a gel and you can see here you get a nice banding pattern. This one is actually the results of taking just this in vitro transcription system from the caucus gene over here we recognize this pattern actually And so over here is the pattern you. Get from an enzyme called B S S H to a known restriction enzymes. And in the middle is what happens if you mix the two just to show that these two banding patterns really are the same. And if they were different. You would get extra bands in the middle. When we looked to the side of cleavage we discovered that whereas B. S.S.H. two left for prime four base three prime overhang this one left the two base three prime over. Sorry the five prime overhang to base three prime overhang. So the two ends are to have no sequence similarity they don't look anything like one another and in fact we found a number of other you restriction enzymes doing exactly the same technique. So I thought this was kind of a funny way of using bioinformatics to to do things that were not immediately obvious and yet actually you get functional information out of this kind of data. Now unfortunately everybody's discovered always hard throughput sequencing in cloning any more. And so the amount of data available for this is gong. So we can't keep doing this for a while it was pretty good. So now what I want to do is to talk to you about a new project is just getting started and it's called com breaks it's called computational bridges to experimentation and it sort of builds on something that has been a focus in my lab for a long time and that is I like to do bioinformatics but I like to sort of make predictions and then test them I like to link the bioinformatics to the biology and I have an experimental lab in addition to several points from the Titians who work with me back in two thousand and four I wrote a paper a little commentary that was published in past biology. And it was basically saying that we have a problem at the time there was a lot of sequencing taking place we would gathering D.N.A. sequences galore. And what we were finding was that there were high. Hundreds and hundreds of genes that you could identify using Mark's programs and other people's programs you could clearly see here is a nice coding region. We have absolutely no idea what those coding regions do. And so my proposal because there are no high throughput methods that we know of to do this my proposal was that we do a pseudo throughput method and we achieve high throughput by paralyzation and the idea is you get all the bioinformatics that you can who want to collaborate on this to make predictions and put them into a big database and so we have a great big database of predictions and then we recruit biochemists to come along and test some of those predictions elected ones things that are going to give you a lot of value in terms of later being able to do bioinformatics or later being able to understand the biology of the organism and in this way we would build a set of genes for which we had well characterized experimental functions for instance it's already known there may be among the genomes that we now at the moment there are plenty of examples of genes that occur in five hundred six hundred of these organisms and it's clearly it's the same gene. We have no idea what they do. Absolutely. Unknown function and so this is a way of trying to find out what the function is to expand our knowledge so that perhaps ultimately one day we really can do systems biology because we know what all the gene products are doing. So I say this was proposed in two thousand and four I got a very positive response from an I age they said let's organize a little conference talk about this and see how we can make progress go forward. So I did that in late two thousand and four I organized a conference down in Washington all of the key people from and I came to this conference of all people associated with the funding agencies. But the problem was that what I wanted to do was. To match up expert biochemists with particular predictions that lay right in their area of expertise and in such a lab. It doesn't actually take very much extra money to support a student to come in. It could be a rotation students could be a graduate student could even be a high school student in many cases but or even an undergraduate to come in and just sort of you know make a little bit of the dream product put it through the ass eyes that are well known in the lab and find out whether this prediction is correct or not because at the moment what happens. Loads of people are making predictions. No one is testing them part of this you know biochemist don't like to test other people's predictions particularly. So the idea was well what we'll do is we'll give a small grant to the lab. We'll keep them five ten thousand dollars just in order to test this particular prediction. So that was the essence of this proposal Well nothing happened until two thousand and eight late in two thousand and eight when and I each in their wisdom decided I should as a result of this all appear always funded said well you know you're not asking for enough money we don't know how to give away five thousand but we know how to give away two hundred fifty thousand dollars but we don't know how to manage your portfolio as small as this. And so nothing happened. I couldn't get them to do anything so in two thousand and eight. Finally an R.F.P. request for proposals came at tackling exactly this problem. How are we going to do annotation better because even and I finally realized in light of the numbers. You will see here this is the growth in the number of microbial genomes coming at this is two thousand and four and here when I originally wrote my proposal. They're getting worried they're spending all this money on sequencing. But the functional annotation of those sequences is lagging way way behind and you could at this point I think make a strong. A swell was a point in sequencing anymore genomes if you're not going to try and work out what the D.N.A. is doing. Why bother gathering more D.N.A. sequence. Well you know obviously D.N.A. sequences don't find that a very popular idea. And so they felt I had to do something about this functional annotation problem. So I put out this request for proposals we were applied to it a group when I say we aren't talking about two Boston University where I have an adjunct position and I and wrote a proposal that essentially flashed at this original idea that I've had in the plus biology article. I just want to illustrate the problem and it really is significant. So mufflers influenza was the very first genome ever sequence this was done back in one thousand nine hundred five. And if we look at the content of the genes you discover that of the sixteen hundred fifty seven genes that have been identified. There are twelve hundred twenty eight that are either now because someone's actually done some experiment on them or you can make a good prediction. Now what do I mean by a good prediction multiple clear I mean a prediction that is based upon some computer looking and saying here's a gene is ninety percent similar to this gene and another organism which has been annotated in some particular way the hydrogenated it's an oxidizer whatever it is and the substrate is known. However there is a problem. So what happens very often with computational predictions is that you've got a gene that may or may not be what I call a gold standard Gene as you will see in a moment where some point someone actually did an experiment. And you find another gene that is ninety percent similar to it. So you say this must be the same and so it gets annotated in German bankers being the same and at some point. Things go wrong. This gene that was originally ninety percent similar to the gold standard gene one where the function was done. Somebody finds a gene that is ninety percent similar to that and say it must have the same function too. And so what happens is the quality of annotation tends to go down because almost all of the annotation is being done by computer programs not by people computer programs are only as smart as you tell and they should be very often things that ninety eight percent identical can actually have different substrates. Many examples of glucose or hydrolysis things that cut up sugars. But you missed a couple but you know acids difference and all of a sudden the sugar that it will cut next to it has changed. Instead of recognizing right. Recognizes silos and so on and most computer programs that are just doing all of this annotation these days don't recognize that you know ninety eight percent similar wonderful it must be the same. And so there is a lot of wrong and attention in G.M. bank some people best are made and there might be as much as fifteen percent of all of the annotation in Jenin bank is wrong. You cannot be trusted. The bottom line is you really cannot trust it. This point since we don't know which fifteen percent that is. So evil these twelve hundred twenty eight genes where there are good put Asians. There's a fair chance two or three hundred of these are wrong. Also the we don't really know what they do. Now what's left is four hundred twenty nine are known these are genes that we have no idea what they do. We're told they're labelled as being hypothetical protein. There is some computer program of one of Mark's programs came along and said this is the start of the gene and here is the end of the gene and you look and sure enough makes a perfectly good proteins for. Fifty amino acids long looks like a genuine protein. We have no idea what it does it could be similar to something else in twenty other genomes probably don't know what it does in any of those genomes either. And so that is a bit of a problem and then there are others that are sort of hypothetical concern. So that's what this twenty six percent is and this number is actually very close to what it was back in one thousand nine hundred five when the genome was first sequence. There's not been a lot of progress in understanding what the genes are coding for. In some of these other organisms the numbers are higher than a carcass was the first our kale genome done thirty five percent. No idea what the genes do. One of the more recent ones is Pseudomonas aeruginosa where we have forty two percent of the genes. Almost half of the whole organisms genome. We do not know what it is no idea what it is doing and yet for many of these genomes people are trying to do what I call systems biology. They're trying to work hand computer programs that will tell you how these genes are working how the organism is working. I think it's not doable. The moment that is just simply not doable but there's no reason to think that it couldn't be doable and I think we should focus on at least a few of these organisms working out exactly what all the genes are doing and that means doing two things. It means. First of all working at biochemically that the genes are capable of and then ideally genetically putting it in within the context of the working of the whole organism. But the thing that is easy to do is the biochemistry the genetics Often all it ever does is to give you a hint as to what's going on maybe give you a prediction as to what this gene is doing it's not often that the genetics absolutely nails the function biochemistry doesn't a over function it nails the in vitro function doesn't. Tell you what pathways it may be participating in and for that you ultimately will need to do with genetics. But very often just knowing the biochemical function can take you an awful long way towards understanding its role within the cell. Sorry wrong way again. So here's the cause of the problem you know you hear a lot about how great it is to be able to sequence D.N.A. for less and less and less. This is the cost per base pair. The number of base pairs you get per dollar and you can see what's happening that the number of base pairs you get per dollar is just going up and up and up is soon going to be right up to the sky. It's amazing how cheap it is now to get D.N.A. sequence and so of course you know it's simple to do so everybody does it but I ask myself what is the point I ask you what is the point in gathering all of this sequence. If we're not going to try and work out what it's doing if we're not trying to try to understand the meaning the biological meaning less so why has there been so little progress. Well there's a bunch of interesting reasons one is the inherent difficulty of the problem if I give you a protein of three hundred fifty amino acids long and I say hey I'll give you a hundred thousand dollars work. What it's doing. You can't do it all you can do is sort of Asif everything that you think it might be doing if you're lucky you'll hit it. The odds are you're not going to hit it and so we definitely need to find ways of making reasonable predictions about what it's doing does it look as though it might be interacting with accidental some of finding does it look as though it's going to need A.T.P. does it have motifs in it suggests that it might be involved in this and this is where the bio information is can come in and there are many ways in addition to just looking for motifs or just looking for a sequence similarity that you can make predictions I showed you one of them in the case of the restriction enzymes genes where you could act. We identified genes that were lethal from data that was carried in some completely other way and there are many other ways you can perhaps identify operands and say Here's a bunch of genes five genes that clearly all involved in human bio sentences but there's one open reading frame here right in the middle of this opera and we don't know what it does. Can we perhaps make a good guess as to what it might doing. We could guess it's involved in here by a sentence or two in some way. And so maybe there's a biochemical step involved. For which we don't have a gene you can go and test it. So there are many ways in which you can do this but it typically takes a combination of biochemistry maybe some genetics. Maybe some computation to begin to get a handle on this problem and not a lot of people have all of the skills. So this cross disciplinary set of skills. That's needed to really make progress does not reside in very many labs a lot of labs just focus on computation others just focus on biochemistry and a lot of them never talk to one another and in fact a lot of chemist don't want to talk to the bio information because they think they're very upper to make it way too many many predictions the other thing that's happened recently is there's this tremendous appeal for genome wide studies everyone wants to do something where you look at the whole genome all at once you know let's not worry about these individual things let's look at everything and in fact it's becoming increasingly difficult to get grants in areas where genome wide studies are popular. If all you want to do is actually look at a few genes study sections say why don't you look at the whole genome Why don't you do micro rays Why don't you do this other. Well unfortunately there are no genome wide ways other than computationally of getting this process this problem to function and then finally there's a lack of appropriate funding mechanisms you cannot go to and I can say hey I found an interesting gene here it's present in five hundred genomes. I want. Work out what it does no one will give you money for that you simply can't get the money to do that. So this is where I come bricks comes in through put by paralyzation so we get the bioinformatics sions on board to make high quality predictions things that really say. I think this is an oxidizer and I think the substrate is likely to be a sugar or I think the substrate is likely to be a turkey and I think it's likely to be an alley frats it can mean an hourly phatic substance of some description so basically we want to get some high quality predictions we want to assemble a database of those predictions and we're in the process of doing that at the moment. Biochemists can then come and test the predictions and initially we will approach biochemists when we have a good set of predictions where on the Web site assembling a list of what we think are the top one hundred per predictions things that if these predictions turn out to be accurate and can be tested and shown to be accurate then that will be very helpful to the information in making better predictions later on and so that is something the an important criteria here because you know you write an algorithm you make predictions. If old predictions are correct you think are the algorithm is great but if half of them are wrong then you've got to go back and readjust your algorithm so that you can get the right answer. More often. And the idea is that by getting biochemists and having them work in their field of expertise the incremental cost of doing this work is really very very slight. And so even though it's quite sly you still have to persuade them to participate so we offer them some financial incentives to do this given five ten thousand dollars and the way one can think about this going. Is that maybe there's a student here at Georgia Tech maybe one of you in the audience who says. I wouldn't mind doing this as a rotation. During my first year as a graduate student and so I will go and look in COM bricks and I'll see where is an interesting gene something we don't know the function for there's a very good prediction you look you find it you say all is a professor at Georgia Tech who is actually working in this area. And so between as we can apply to combat for a grant perhaps get five or ten thousand dollars so that I can go and test the prediction in this person's lab. So this is very good for the lab to get a little extra money. Very good for the student because if in fact they can validate that prediction. You can publish that result will be a publishable result and in the process you will help both the bioinformatics and the biochemical communities so you do a very valuable function a very valuable service for biology as a whole and the nice thing about this is that there are so many of these genes I mean there are thousands upon thousands of these genes but no one has to do exactly the same training everyone can do something different. You can all make a contribution and it can be really very valuable and so we're hoping to get a lot of momentum within the community to do this a lot a lot of opportunities a lot of possibilities. So one question that comes up had can we improve the predictions Well I think one of the things is we should recognize the vast bulk of the current annotation is really based on sequence similarity to previously annotated genes. This is OK but there are some problems with it that I will point out in a moment. We have to recognize a lot of the annotation is wrong for exactly this reason I quoted earlier that something is ninety percent some similar something we know the function of and then something is ninety percent similar to that and ninety eight. Scent similar to that before you know it you're ten percent similar to the original gene and the chances of it being correct a zero so. We have to do provide a much better basis for similarity based annotation. That will be the subject of the next slide and then we will provide within the complex web site a means for reporting Messina tensions. What. All of this boils down to is that we need a gold standard set of genes and proteins we need to have this set where we can say we know the sequence of this protein. We know the function of this protein because the biochemistry was carried out on this protein with this sequence and then have a standard we have a gold standard against which we compare every new gene that comes into the database we can say how far away is it from the gold standard and at some point it's sufficiently far away you say well maybe we better just recheck make sure that it still has the same function it's eighty percent similar maybe we should check maybe it's not really got the same function. So the question is Where would you go with such a gold standard set where you might think you know we've got big data massive databases we've got gem bang we've got the D.N.A. database in Japan. We've got you know prop which is supposedly the database for proteins. We have the whole I Database Gene both of which have huge lists call I.D.'s and functions associated with them and you might think. But these would already be gold standard databases they're not. If you go to any of the right databases and you can we find out for this gene where this function was determined was in fact the gene sequence upon which this function was determined we discover you can't even find that information. This is not something that needs databases of considered worth tracking and what's happened in the. Color is that perhaps someone in some in color strange determined are this is the D.N.A. methylation D.N.A. methylation E. coli. But perhaps the function was actually tested in a strain that was a long way away from anything that's been sequenced and in fact the gene was never sequence to the time. And so on and you discovered a large number of the whole of proteins fall into this category the sequence that is in the databases this ecology in may not be the one where the function was determined and so and the same is true you know crowd and in fact I can tell you all of these databases are rather embarrassed by this fact because when I asked them this question everybody said no we can't provide that. And when I proposed this idea for a gold standard database everybody said What a great idea. We should get this. And so now they're all busy cooperating with me and with Combray some with us to actually assemble this data set. So this is the essence of it but each protein in the set. We're going to find a reference that describes the determination of function going to make sure that the strain is well defined and noted. And we're going to make sure especially that the protein sequence is known for this protein in which the function was determined if it doesn't for if we can find this match between a protein sequence. It will not go into the gold standard set. So we're putting this together it will probably take a couple of months to get a reasonable start on this because both N.C.B.I. And you know products are busy working out procedures to make this easy but we have got started on this. We have about a couple of thousand candidate genes and for each of these genes that being sent out to manual curators who would go along. Through literature and I actually find whether or not the literature supports the assignment. So we identify candidate side from databases or from individuals and we welcome anybody who wants to submit a gold standard what they think should be a gold standard on the complex web site. You know prot will then make sure that there's a unit product number for this they will go through their annotation and prepare a template that in principle contains reference to the sequence reference to the function and a few other things about the strain designation and so on and then these are being sent to curators this is something we're doing as part of the complex project at the moment but ultimately will be taken over by and C.B.I.. And what we're asking is just the curators manually check the strain information is accurate and sometimes people say well you know we did this on the color line. Well that becomes unknown strain. It's OK that's fine but one should not therefore assume that it's the same as something that is in the sequence databases. We want to make sure that the biochemical characterization is accurate description that is in unit really matches what was done and then the fact that the gene was sequenced and hence the protein sequence is known and from exactly the strain in which the biochemical characterization was done and I just put a note here you've got to know you know the sort of the gold standard sequence we call lions for a strain M G six hundred fifty five. There has not been very much characterization done on genes from the strain so it's always important to know where it was to make sure the sequence is the star the same as M G. Sixteen fifty five before the annotation is put in as gold standard. So how can you help. Well there are many ways in which people can help you can send in candidate gold standard genes and proteins you can volunteer to serve as a manual curator you just. Rich Roberts Roberts said anybody dot com and you can browse the complex web sites and let us know if you see genes should be gold standard the complex website is complex you got edu again if you google complex you will find this. Now how are you going to prioritize Gene since I told you you know there are hundreds of thousands of these things. Well the first thing we want to know how many organisms contain this gene for which there is a prediction and the larger the number of genes and especially if it transcends kingdoms the implants if it's in animals then that gets to be a lot more interesting. So if any gene that is present a large number of organisms Obviously if you can work out what it does you get a lot of bang for the buck. We're also very interested in trying to finish the functional annotation of E. coli and so if the gene is currently unknown in color and if there is a good prediction and that would be a high priority for testing. Something for Helicobacter pylori we wanted to show that this approach could be used to annotate pathogenic genomes obviously enter and I aged very interested in pathogens. Aleko back to Laura a very interesting. This is the organism the course has also but it's also one of the few bacteria absolutely known to cause cancer. It causes stomach cancer. It also causes this off. Adriel cancer. It can be grown in the lab and it's actually relatively easy to work with. Is there a clone or purified protein available the structure of you know makes initiative has actually generated many thousands clones of interesting unknown open reading frames that are present in G.M. by and the idea there was that they would say well let's make this protein Let's try and crystallize that if we can get a structure then perhaps we can determine the function well it turns. When you try that approach about one in ten of all the proteins you make actually crystallize and then when you get the structure. Most of the time you haven't got a clue about function structure does not easily give rise to function but all of these clones and all of these proteins were viable and we can provide those we can tell you how to get those and then there's a whole bunch of other criteria I don't want to go into so there's a whole list of things that make it interesting for particular proteins. Who can help. Well the computational biologists and the biochemist obviously also geneticists can help. Finally university students high school students who are actually going to run an experiment with our local high school to have their students come in to buy a labs and do some of these tests as part of a high school fair project. I was at the Intel Science Fair earlier this year and I can tell you the kids who were participating there from all over the world are more than qualified to do these kinds of tests if they go to a decent lab and get some supervision and I think you know you could easily go and find high school students here who could come and work in labs at Georgia Tech that would be straightforward. You know here professors there are a lot of professors who retire and go emeritus and they love doing experiments. Many of them take the opportunity to go back to the lab. Not looking at anybody in particular but I think there are some opportunities here to find finally of course we need help from the funding agencies and I've been talking not just an eye to N.S.F. to him. I have a Hughes for the Welcome Trust in England a lot of them of expressed interest because this can actually be a wonderful teaching tool. There are many ways in which you could build this into undergraduate education and many ways in which you can run some of courses in which you get students to participate in something like this and a lot of times you don't need a lot of funding you know. To get it going. We're hoping to make this into a worldwide effort. It's an ideal project for developing countries where they don't have a lot of money for science but often there is a lab that has a lot of expertise in an area and can do this and can use this to train students. So I think there are lot of possibilities for people to get involved in this. And of course you can help. These are the people that are involved myself Simon to see if who is a computational biochemist than a Boston University a guy who came out of computer science and moved into Bioinformatics. Martin Stephan is a bio chemistry and he's looking after a lot of the biochemical side of things. Charles de Lisi is an emeritus professor be you. He was the guy who first provided funds for the human genome project when it got started. Daniel said Gray is a young assistant professor at B.U. who looks at networks he's interested in networks with in bacteria and other organisms. Stephen Saltzburg is a well known computational biologist at the University of Maryland. Denis picked up again another well known Byron from attention at Columbia. All of our collaborators we already have a lot of collaborators both on the computational front and on the biochemical front. We have half a dozen projects already funded to get going on this and of course we're very grateful for funding from the and I ate they gave us a to get Grant on the stimulus money so this money is going to run ads and we're busy working out how to get them to give us a lot more because obviously we're just at the start of this but I think the very exciting project. It's something in which we can show that we as biologists really can collaborate because however you look at it. Science is a collaborative activity you know everything we do builds upon stuff that's been done elsewhere. And I can tell you it is a lot more fun to collaborate than it is to complete a long standing collaboration with my colleague here at the front. Really and crystallographer we've worked together for many many years and it's so much more fun to collaborate than it is to try to compete with people and I hope that all of you will but at least think about this and see if there's some way in which you can help and collaborate. Thank you thank you very much for the inspiration to get the ball. I'm going to buy over the place in sequence is the book to me which is increasing all your problems and you should see some problems. This is your question my letters for you with my eyes help you need to construct to convert double standards or anything else. Even your heart. I read well. So I think first off let's say you have a protein. If you have a gold standard. There are many ways in which you can say how far away it is so obviously just a simple statistical test and many amino acids are identical but also you can look at motifs are key motifs conserved in particular if you know that within the gold standard. There is a little area that forms a catalytic sites all such substrate. Condition site and you can see that that is different in this new gene there already. You can begin to say well you know maybe this new protein doesn't have the same function maybe it's different. Many of the programs that are annotating genomes at the moment don't even do something. The simplest and so I think the idea of providing the gold standard is that it will really give the annotator. The opportunity to do a much better job of annotating things because they have something to relate it all to and that's not what is happening at the moment. You know there are more of one on one. So we don't want to call it works you call a cute little guy so expression is actually a lot easier than it used to be but there are a lot of tricks and I think one of the key things is you really have to keep your eye out for where people have come up with good traits now one of the things we're focusing on bacterial and genes not doing any eukaryotic genes at the moment in this particular project. Partly because of expression but partly because you carry out it genes are so damn difficult and complicated even predicting where they are can be a big problem. I mean I'm sorry I discovered this split gene thing because it just complicated biology enormously. But I figure if we can really get everything going with bacteria or an R. K. or and get a good group of people together and get everything going. We can at least show that it will work and then we'll expand it into you carry out later on so that is the idea at the moment. So call out in which we know that you know it's nice that in message. You know. So we're not looking for cellular function. All we're trying to do in this particular project is to get at the biochemical function because if you want to go beyond that you're talking about really a huge amount more work and not something that can be funded for a spall amount of money. What we're hoping though is that what will come out of this is that people will find some new functions that can then lead them to right are ones to explore it and find out about the biology what that OK. And what's right. So for protein protein interactions I think we would be satisfied at this point if we could identify the binding partner. So for instance if it's a transcription factor. What is the D.N.A. sequence that it binds to if it's part of a multi subunit complex. What what is it a part of the right to some of proteins you know that part of a massive complex some of them have enzymatic function but a lot of them don't. But just the fact that they're a part of that complex is good enough for our purposes at this stage. So we're not trying to solve the whole of biology we're trying to do something pretty simple that will actually lead to better annotation of genomes so that's really what it's all about you know right. So I told you we have this class of enzymes that are recognized by represented by M M E one this is a single train restriction modification enzyme and in this particular case we already know how to do that we can make many changes specifically in that particular class of enzymes. We have another class of enzymes that we're just starting to make some progress on where we think we will be able to do the same. Sort of thing and that is our go buy a lab says not to just keep offering more and more enzymes. But to say you want an enzyme with a particular recognitions sequence specificity Telos and we will make it so this is our goal but most of the work up by a labs now is our aim towards understanding enough about specificity so we can make new ones by design. Yeah. You don't. OK Just a scratch I'm sorry this is working for first class restriction enzymes structures from you know for sure. Structures and watch. It's a very mature for the woman so that fall into about four classes you can classify them as four among the ones we know we don't have a lot of structures we have right things. Sixteen or eighteen structures at the moment and we certainly are certainly not fully representative of the range of restriction enzymes that we know about and in particular the thing that I would like to see some structures for examples of enzymes that recognise exactly the same sequence and cut it exactly the same point. And yet show no sequence similarity and we in a while you have one such example of that although it's not quite that Eco are one and bam H one recognise different sequences but the fall that almost super imposable the structures and the only three amino acids that they have in common are the three that form the catalytic triad that do the cleavage and that's not of any use in terms of predicting stuff which you are clearly one among young people five years we've got to see because they're pretty much reacting in a fine. Gesture to a late sense. Right right. And we have tried some of the. At some cases you can make predictions about what these things are going to recognize. But in general that is not that approach has not worked terribly well with the sequence differences are too different. So for instance I go I want Bama each one which almost coincide. If you try to thread one on the other it doesn't work the threading algorithms don't get it the same. So it's a problem structure protection in general is a bit of a problem if things are close and threading works very well but as you get further and further away. I mean lots of these have sort of you know four percent similarity of props and identity and that it's quite a problem. There's that we be glad to send you some if you want to try reading signals the functional size trying to use them. Why this mean. Terrible shock if you want to actually write but the problem is that the catalytic triad in Bam makes one an echo of one you've got one here one here one here in the sequence in space right space. Right. See that's the way it goes and so you get at least in theory which might be OK Well we'll send you some to try and there was you one question I think. Yes OK he's right. If I was running and I would do that but I'm not OK. Yes. Membrane proteins are a major. Well I'm from every conceivable function that you can imagine. So from a structural perspective from a functional perspective and I think at the moment we're going to leave the membrane proteins to the experts So for instance some of them. We know are transporters some are involved in transporting things to and fro and we really Milton Sawyer is a collaborator of Oz and so he is going to deal with that aspect of it in some other cases we know that they're just integral membrane proteins. You know what they may be interacting with what their next to what kinds of things they bind and so we would just do the best we can but they're likely to be the very last proteins that we have a kind of sign a decent function to but even knowing that they're an integral part of the membrane and particularly if you have some idea of which part of the membrane they are involved in that can be helpful and you know. Biology is incredibly complicated. I mean you know these people who think they're going to understand the nervous system I say you go back and figure out how some simple bacterium works bacteria a lovely day a beautiful beautiful organisms and we should be able to understand how they work or by that I mean we should be able to write a computer program that will simulate one. But we have to know an awful lot more about them before we can do that and I will guarantee that once we understand have bacteria work. We will have a much better idea of how humans and other you carry out its work but we believe are complicated. It's quite amazing and it's great you know because it means we've got jobs for years to come right thank you very much. and so he is going to deal with that aspect of it in some other cases we know that they're just integral membrane proteins. You know what they may be interacting with what their next to what kinds of things they bind and so we would just do the best we can but they're likely to be the very last proteins that we have a kind of sign a decent function to but even knowing that they're an integral part of the membrane and particularly if you have some idea of which part of the membrane they are involved in that can be helpful and you know. Biology is incredibly complicated. I mean you know these people who think they're going to understand the nervous system I say you go back and figure out how some simple bacterium works bacteria a lovely day a beautiful beautiful organisms and we should be able to understand how they work or by that I mean we should be able to write a computer program that will simulate one. But we have to know an awful lot more about them before we can do that and I will guarantee that once we understand have bacteria work. We will have a much better idea of how humans and other you carry out its work but we believe are complicated. It's quite amazing and it's great you know because it means we've got jobs for years to come right thank you very much.