[00:00:05] >> I'm delighted. Next. Speaker John Magaw. My name is a senior was a little I sort of us quite executive director for the Institute for data engineering and science also intern chair for the school of computational science and engineering a Jesuit deck John. Is a professor in the school of biology here we got him in 2005 to sort of a state of the school of biology and he also served as a farmer as head of the department of genetics at the University of Georgia. [00:00:44] In 22 all of John created the integrated Cancer Research Center and he sat us as the director of that center since then he's also chief scientific officer for the Ovarian Cancer Institute. And what I like about John's research is his single minded focus on trying to understand the mechanics of cancer and as well as how in diagnosing cancer and treating cancer. [00:01:15] The other aspect of his that he says that I really like is he's not afraid to go where cancer he says takes him in particular he has been really using engineering principles and computational principles in advancing cancer research he's a fellow of the American Association for the Advancement of Science and in particular he has been doing a lot of good work on ovarian cancer so please join me in the coming. [00:01:49] Thank you. Right so I think maybe all of you know that cancer is a huge business 1000000000 dollar business and because of that I often get asked the question how can Georgia Tech compete in this area. And the answer to that question we think is to explore it's to integrate the strengths that we have in computer science nanotechnology analytical chemistry areas like that those are strengths that are stronger at a place like this than they are at medical schools and so if we can effectively figure out how to integrate those strengths into the problems of cancer we should be able to be competitive with major cancer centers so that's the strategy that we've taken towards that and a few years ago we formed this into a great cancer research center which is at this point a collection of about $45.00 faculty with expertise in a broad spectrum of areas but all of which I think can be effectively focused on the various problems of cancer so that's what we're trying to do here to get into cancer research at Georgia Tech. [00:03:10] So what I'm going to talk about today of course because of the theme of this meeting is how we're trying to integrate computer science in particular machine learning into cancer diagnostics and therapeutics I'll talk a little bit about the diagnostics things we're incorporating analytical chemistry. Combined with machine learning to try to come up with better diagnostics earlier diagnostics for cancer but I'm going to spend most of my time talking about personalized medicine and personalized cancer medicine and by that I mean when we the person presents with cancer where we can genomic Lee profile their tumors and from that try to predict an optimal drug therapy for that particular individual and that's what I'll try to talk to you today mostly about. [00:04:04] Before I start that I want to in the spirit of full disclosure. Let you know that I'm a molecular biologist I'm not smart enough to be a computer scientist but I am smart enough to know the potential utility of computer science in applications of cancer and so that's what I'm going to be talking about with you to you today. [00:04:35] So. Accurate predictions are something that's critically important both to cancer diagnostics and cancer therapeutics and as you probably realize there's 2 ways to make predictions one way is if you understand the cause and effect relationships that are on the line the phenomena you're trying to study and from that you can predict what's going to happen in the future for any particular application. [00:05:08] The other way to make predictions is just based on correlations and so if you see a peer There are formulas that can tell us how much friction is required to generate enough heat to produce fire so that would be a prediction based upon cause and effect relationships or you can be like the caveman and just rub sticks together fast enough and find out that it starts fire and then you can make predictions based upon those correlations so in cancer biology ideally And in fact scientists like myself of ours been always been trained that we should try to make predictions based upon cause and effect relationships and in cancer we do know is some cause and effect relationships in this picture up here is showing probably the poster child of cause and effect prediction of drug use in cancer and it's based upon our understanding of pathways that are involved entire scene kinases and we know in this particular case this is the B.C.R. able translocation Agenor it produces this Tyrus in kinase constituent of lead at high levels which then signals the cell to go into rapid division all the time and so knowing that cause and effect relationship then we can propose that a drug that would interfere with that tiresome kind it should be an effective therapeutic and in this case it is for the people with C.M.L.. [00:06:45] However the problem is that in most cases in for most cancers we do not fully understand the cause and effect relationships that are going on and so for that reason it's hard to make accurate predictions based upon that kind of. Effect however. B. there's huge Olmec profiles and genomic databases being generated every day and I get down here the latest estimates. [00:07:23] Currently D.N.A. sequencing R.N.A.'s expression profiling is currently doubling about every 6 to 7 months and within the next decades predicted to be about $40000000.00 gigabytes a year and that's probably an underestimate So we're gathering huge amounts of correlative data which is really a playground for people like you there doing machine learning so the question is how to apply that data how to apply machinery and utilize that data to improve cancer diagnostics and therapeutics and so for the work that we're doing in our lab in our collaborators we're taking height throughput data and we're using machine learning in collaboration with computer scientists here at Georgia Tech to make predictions and the predictions will be either in terms of cancer diagnosis or optimal drug therapies. [00:08:20] As you know there's many approaches to machine learning and the one that we started with and it was largely due to a historical accident that. The collaborator that I began working with here in Georgia Tech was an expert on support vector machines and so that's the approach that we took since that time we're now expanding it to other approaches. [00:08:44] But at least from the. Outsider's point of view I don't see a tremendous difference in the accuracy we're getting between the different approaches to it so what I'm going to talk you today about is the work we've done using machine learning and why we got into that is because the computer scientists all began working with here about 10 years ago it was Alex Grey and as I said he was an expert in that area and so he was the guy that we started collaborating with and that's why we ended up using this approach and continue to use it so our 1st application of machine learning what's too in the context of cancer diagnostics and so cancer diagnostics means for one thing we'd like to tell people do you have cancer or do you not so it be nice especially and a disease like ovarian cancer which is often called the silent killer because at early stages in the disease it's a symptomatic and so you don't know you have the cancer until it's progressed well beyond the early stages in that by the time it's it's diagnosed clinically it's difficult to treat it's already usually metastasized throughout the body requiring chemotherapy cannot be cured with surgery if one could diagnose a disease like that early it could be effectively treated by surgery you could have your ovaries removed and be effectively cured of the cancer so the key there is to try to have early diagnostics. [00:10:15] And so we want we want to take an approach that a lot of people use different so called biomarkers that of the disease many people focus on proteins as biomarkers you all of heard of the P.S.A. test for prostate cancer that's a protein that is elevated associate it's elevation is associated with early onset prostate cancer. [00:10:39] And so what we're what we started looking at was rather than proteins was metabolites and if you ground up a cell there's basically 3 kinds of juice in that cell there's nucleic acids so that be D.N.A. R.N.A. protein and everything else falls into the bag metabolites so metabolites of the products of bins on reactions things like limited strike lists or ions all of that stuff would fall in the bag of metabolites and so one can take fluid from a person so either you blood urine. [00:11:20] Any kind of bodily fluid and you can analyze that in an instrument called a mass spec and we have some experts in analytical chemistry here at Georgia Tech and I hope one in particular that we work with is for kind of Fernando's whose pictures appear on the right. [00:11:38] What this instrument does is you basically vaporize that molecule it goes through a vacuum tube and it separates molecules by size and charge and save that sensor down at the end and as the molecules come through the smaller molecules will come through faster so you'll see a peak as a function of time for the smaller molecules in as the molecules the larger the old peak at a later time so you end up with a feature set as you can see up there of these molecules that are going through so in a sense you can think of that is like a fingerprint of the juice that you put into the instrument and then you can set your instrument to look at specific sizes and as I said we focused on metabolites so they were smaller than proteins and so we put the window down to be for small molecules so they were looking only at metabolites with this thing and then the idea is to run these profiles over a large number of individuals some of which that have cancer some of which that do not and that data will constitute your learning set so you'll be looking at profiles of normal healthy individuals profiles of cancer patients and you're looking for features that can discriminate between the 2 and so that's the approach that we were taking and then you go through it which I'm sure you're all familiar with feature selection so we may start off with something like 5000 features and we try to narrow that down by doing feature selection what we're looking for is what's the minimal number of features that give us the highest accuracy and as you can see here in this feature selection thing we optimized in this initial study which which was based upon about 100 patients 50 normal healthy women and 50 women with ovarian cancer of early stage were in cancer we optimize that around 16 metabolites it gave us close to 100 percent accuracy in discriminating between that group doesn't mean this test will be 100 percent when we extend it to the entire population but in that cohort that we study it was highly accurate. [00:13:52] And so. What we're doing now on in this particular project is expanding the number of patients samples that were running and so we've got samples locally from the Ovarian Cancer Institute here which is based at the current time at Northside Hospital which is a major cancer center here in Atlanta we're also getting samples from Fox Chase Cancer Center in Philadelphia and from bio banks in Canada University of Calgary Calgary so at the present time we have 800 ovarian cancer serum samples that those same samples are now going through the pipeline they're being analyzed by mass spec by Dr for condo and they've got all his assistant and then in the machine learning part I have a number of outstanding students who are working on this and of course I'm not a computer scientist so they have to take classes here. [00:14:53] In machine learning computer science and then that's where Pink you and Lee song come in and these are the instructors and they also go back to these people week we go back to him for questions that we have along the way with regard to the computer science of it and so what we're trying to do is build models then and with these 800 samples we're going beyond just the question of can we distinguish cancer from normal because we know we can do that we're interested now of see if we can identify not build models that can discriminate cancers of high malignant potential versus those of low malignant potential and this is going to be an important bit of information for clinicians because if a woman presents with evidence of early stage ovarian cancer and suppose she's a young woman do you remove her ovaries which precludes her from having children or if it turns out that she has a cancer of low malignant potential You don't necessarily have to be so extreme in your treatment so all of these kind of answers to these questions are of clinical significance and impact on cancer patients. [00:16:04] OK that's what's going on in that and the jury's still out on how far will do so I'm going to move to the 2nd application we're using which is the use of machine learning to try to. Predict optimal drug therapies for individual cancer patients and this project was started and with a colleague of mine Fred Baron Bert van Berg in the Department of Biological Sciences and in the beginning with his graduate student kind of weighing right so when we began this several years ago we started with cell lines rather than real cancer patients and we started with a set of what's called the N.C.I. 60 cell lines this is actually a set of 59 distinct cancer cell lines representing 9 different types of cancer as you can see listed there. [00:17:00] All right and when you're going to build models. Machine learning models you can build them on individual drugs and so we initially build predictive models on the for the 9 F.D.A. approved drugs you see up here the reason we selected these is because these drugs have been used in one way or another in the treatment of ovarian cancer which was our focus initially when we started the standard of care therapy for ovarian cancer is a combination of carboplatin and unpacks attacks on the 1st 2 that shown up here once the D.N.A. damage an agent the other one is involved with microtubules cell division but about 20 to 30 percent of cancer patients will not be responsive to the standard of care therapy and so after that then the physician really goes into a trial and. [00:17:56] Error mode of treatment trying different types of chemotherapeutic drugs to see what works and these listed below here are some of the drugs that are selected for these 2nd line therapies and so we want to build models on all of these things and so in this case thief data that goes in our gene expression profiles so we the patient comes in we get a sample of the tumour we extract R.N.A. from that tumor and then we look at gene expression profiles those can be looked at either through a technology called Michael Arad which I'll talk about 1st or you can do sequencing of the R.N.A. molecules and quantitative by simply counting the different number of counts of the R.N.A.. [00:18:48] And then so that was our input data and then we know that response of the 9 key cancer cell lines that we're looking at to a variety of drugs because these N.C.I. 60 cells have been tested against about a 1000 different kinds of drugs by N.I. age and that data is all publicly available as is the gene expression data for all of these 9. [00:19:14] Sellings And so though by combining those 2 we try to build a model where we can connect the genomic profile with the optimal drug therapy for that individual profile so for support that the machine on the rhythms it assigns a score to each sample and generates a vector or a line that the stick that distinguishes the samples predicted to be sensitive from those resistant so for example in this site here you see the S.V.N. is the demarcation line if the score is above that a plus value we predict is that this cell line will be sensitive to the drug responsive to the drug if the score is below that it's predicted to be non-responsive to the drug then we can look at the actual data that was done on the drug sensitivity of these cell lines and if you look at the sensitivity of these $59.00 cell lines to the drugs that Norm it fits pretty much in normal distribution that is there's some cell lines that are not responsive it all some that are in a lot of them sort of in the middle between when we're building the model we we. [00:20:28] Chose as our learning set cells that were greater than a half a standard deviation away from the means in other words we want to build our models on the more extreme ends of the spectrum cells that are definitely sensitive to the drug versus cells that are definitely not sensitive and we build a model there but then when we test a model we test it across the full spectrum of the cells and so we can then I see 50 is usually the metrics that used to. [00:21:01] Impart the information on sensitivity of the drug and so with the I 50 scores if we plot that out on the X. axis here in again if you've got a I.C. 50 score lower meaning a smaller amount of the drug is needed to kill the cell it means you're more sensitive if it I see 50 scores high it means you need a lot of the drug to kill the cells and that would be indicative of the cells being less sensitive to drugs and so we can plot that out on a axis and then we can plot on the I mean that was a Y. axis we can plot on the excess axis what our predictive scores were from the S.V.N. models again we predict the to sensitive enough and then by making this matrix we can cluster the results into either either your. [00:21:57] Predicted to be resistant and your observed to be resistance or that's a correct prediction you're predicted to be sensitive and you're observe the Be sensitive that's. The other prediction or you can also identify the false positives that is if I predict to be resistant but are the observed to be sensitive and that's an incorrect prediction and so by play out plotting the results out in this kind of matrix we can come up with and we can compute things like sensitivity which is you know true positives divided by true positives plus false negatives and sensitivity sensitivity is how good is your model in predicting response if if it is a true response and specificity is how many mistakes you are basically what's your accuracy and then accuracy can compute it that way so if I show you some of the results of what we got on the cell line studies this is for a particular drug card of platinum and based on the models that we did for this drug you can see it was pretty good so the overall accuracy of our test was 84 percent So 84 percent of the time when we predicted cell line would be sensitive to drug it was a one it was not sensitive it wasn't. [00:23:13] For other drugs like tax attacks again we've got fairly high accuracy 85.7 percent and so forth we went through for all of these drugs and so as a general rule we were getting pretty good accuracy the lowest maybe was around 75 percent and the highest somewhere around 88 percent accuracy now you can Kate take that kind of data and the predictions versus the. [00:23:40] Actual observed data and do these kind of density plots So what you're seeing here is what I would call a pretty good model and an excellent model because red and green is showing what the observed results were so if it's red it means that we know these cells have been demonstrated to be resistant green means they've been demonstrated to be sensitive and that down below you see the predictions our model is completely 100 percent accurate but sometimes the data looks like this and so that means your predictive model is not quite as good as you might wanted to be and you get this kind of result and you're dealing with a lousy model and it's really not predicting which very good accuracy so we Swan had to start to address some basic questions about the use of machine learning and as you all know the dangers of any kind of computer model is garbage and garbage own so from the point of view of cancer biologists we want to know what is good garbage and what is bad garbage of what what kind of things can affect our predictions so the 1st thing we asked ourselves is Should machine learning models be built on day 8 on tissue of origin basis or not so traditionally people have thought that if I have ovarian cancer. [00:25:05] And my friend has breast cancer those are 2 completely different diseases and that's the belief that cancer should be defined or characterize based upon its tissue of origin but is that we study more in the molecular level we realize that that's probably not true because if we look at what genes are abnormally expressed in different tissues we can see a lot of overlap between cancers and in fact there are some breast cancer. [00:25:34] Tumors that on a molecular level look more like ovarian cancer tumors than ovarian cancer tumors look to other types of ovarian cancer so that the I think that there is a paradigm shift going on at the present time in cancer biology. Stating that cancer probably should be defining characterize based upon their melech ular profiles as opposed to their tissue of origin but anyhow we wanted to test this and to see if that we could get any evidence that that's true so we went back in took our seats our and C.I. 60 cell lines and we selected 17 cell lines to study and to build models and to test on but they were 17 cell lines were selected from only 2 cancer types. [00:26:27] That was then compared to again selecting $700.00 cell lines to build the models but now selecting those $700.00 cell lines across all of them 9 cancer types so the difference here is the spectrum of the the the variety of cancer cells types that we used to build the model and when we did that and looked at our predictions you can see that the in the top figure we're getting a lot more overlap that is our predictions things that were predicted would not be responsive turned out to be response of things we predicted to be responsive turned out to be nonresponsive and the our models were improved in those cases where the model was built across all 9 cancer types as opposed to just 2 cancer types and so we've gone on to look at this kind of thing in more detail and it therefore suggest that again it's probably better to build your models across this broader spectrum of cancer that you can because you more likely to encompass all the various genomic characteristics of the tumor and then let the machine learning narrow it down as to what are the most informative features if you go into and try to predict you know select as to what data you going to feed in you could be wrong. [00:27:56] So intuitively it seems like if I want to predict ovarian cancer I should build my models on ovarian cancer but because of the fact of varying cancer some cases it's more similar to a breast cancer and for some patients that that presumption is probably incorrect so we think the model should be built on is broader spectrum as we can do right the next thing we wanted to look at is. [00:28:22] The probes that's going on so this is looking for microarray data so if you don't know what my core is it's like the a glass plate that type with pieces of D.N.A. representing every gene in the human body and we extract R.N.A. from a tumor it's fluorescent labeled we pour it over this glass plate and if the R.N.A. is made from a gene it sticks to that D.N.A. And so you end up looking at a thing with bright spots and black spots if it's black It means that gene is not turned on if it's a bright spot I mean the gene is producing R.N.A. in that sample and the intensity of the signal is a function of the quantity of the our name now on these kind of chips that are designed there's usually about 9 probes on each chip representing one genes of genes this long but we put a piece of D.N.A. from here a piece of D.N.A. from here piece of yours about 9 of them on the chip. [00:29:23] And then the hybridization is done and what is typically done is then to take the R.N.A. signal off of each of those 9 average it to get an average value for each gene and if you if you do that and that's the way most people do it we were interested to see is that a good way to do it or should we have as our input each probe of each gene and maybe you don't know there's but genes aware that they can do things like splice alternatively So you produce R.N.A. of a gene you can get splice variants so that these these occur after the irony is produced in those spice variants go on to produce different proteins so there's about. [00:30:07] Well we estimate there's around 27000 human genes the gene is the finest segment of D.N.A. that encodes a protein but if we look at the number of proteins it's about 200000 proteins there's a lot more proteins in there are genes why because there's a lot of splice variants for every hunk of D.N.A. that's in coding a protein so you get variant forms of proteins off of this and so from a machine learning point point of view what should we be loading into our auger them's the average values for genes which is what has been typically done or should we feed in all of the probe sets if we feed in all the probe sets we're going to get more information because it'll pick up different spice variants if we put in average we're going to be putting in less information so we want to test to see if that had any relevance and we took a data set that was it's a breast cancer data set that was given out in this they have these dream challenges I think every other year or something to see who can predict the best who's who is what model is predicting the correct drug for this particular dataset at any rate we took this data set and we took the average values of all of the gene expression which is what was presented to the competitors and it's a challenge but we also went back to the original data and was able to extract the information from each probe and so we built 2 models predictive models one with the average value one with values for every probes and you can see that the difference here if we use the average the the accuracy of our predictions was 78 percent if we used all of the probes. [00:32:00] Are accuracy one up to almost 90 percent so it made a huge difference as to whether we were feeding in. Average values are the full spectrum so again I think the message we're beginning to get here is that one should build models with as broad a spectrum of data as you can and let the model let the machine learning determine what is the most significant Don't try to predict anticipate what is should be important what isn't and then finally the same kind of message came out when we there's a there's a group of genes now and it's contained in a database called Cosmic which is genes that people believe are drivers of cancer so Vala $23000.00 genes that we have in it it's estimated that around $500.00 or so of these genes are actually genes that can be actively involved in producing cancer so they call these cancer driver genes so again we went back and said If we build our predictive models based on the gene expression values of all genes versus the gene expression values of cancer driver genes what do we get for relative accuracy so we selected 315 cancer related genes this is a data set a set of genes that's used by one company called Foundation one that makes predictions about optimal drug therapy and we compared that to models that we built with the input of all genes and here you see that the accuracy if we used only the cancer driver genes is 59 percent if we use all genes are accuracy was 89 percent So again it's telling us. [00:33:47] Since we don't understand all of the cause and effect relationships and cancer we should not filter date or prior prior to applying machine learning because we don't know what would be cutting out we just don't know enough it's better to let the machine learning sort of the most effective things and if you try to filter things in fact you're going to lower your accuracy right how does how well does this stuff work in cancer patients as opposed to cell lines or so we started off by getting data for 419 ovarian cancer patients and this is Gene expression data of their tumors and we took that data and put it into our all good rhythm that we had developed with the cell lines and we predicted for each one of those $401000.00 patients do we predict you're going to be responsive to a drug or not responsive to it and then looking over the $409.00 individuals is going to be a distribution there's going to be some patients for example if we look at it Jim said it being arm our predictive model say about half of the patients will be responsive and half of the patients will be non-responsive and they did that for each drug So for example this drug would we would say this is a lousy drug to use for ovarian cancer because our model predicts that almost all of the patients will be non-responsive only a few of the patients were predicted to be responsive. [00:35:21] For Cisplatin it was about $5050.00 for this drug again very poor drug we would predict what would we say would be the best drug it looks like Carbo platen down at the bottom or tacks on the next one up and lo and behold those are the 2 drugs that over 100 years of therapy clinicians have decided that the best ones to use for. [00:35:47] Standard of care therapy so it turns out are models which didn't know anything about what clinicians were using ended up predicting the same thing in fact if we go to the literature and correlate our predicted score with the actual observed. Benefit of a given drug in an ovarian cancer population over thousands of ovarian cancer patients the correlation is quite good so that our model is predicting with fairly good accuracy how patients will respond to these different drugs. [00:36:23] Right how does how well does the model work in predicting individual patients so for this we went to another database called the T.C. GA database it's comprised of about 2.5 percent of bytes of data including genomic profiles of tumor in match normal tissue from more than one in 11000 patients representing 33 types of cancer so that the problem here is despite the impressive size of this data set we're limited because all it's giving us is gene expression profile that we do not have matched information on how well those patients have responded to a drug so we sense the availability of the correlated sets of data that we really need to build the model was limited we had to combine data from patients associated with a diversity of cancers. [00:37:23] But for what but for which they were being treated by the same cancer therapy right so in the true cancer therapies we're talking about is gemcitabine and 5 floor yourself so we're looking for patients that were treated with those drugs and we don't care what cancer that they had why did we do that because we there's only a few. [00:37:46] Patients that have information on expression and response to the drug and so that's why we had to do it this way and this just shows by the way this work was led by my brilliant graduate student have in Clayton I'm not sure he's here today but if he is that's what I want to say he's brilliant. [00:38:07] Here's the data set that we used so the drug gemcitabine you see above it shows the number of response of individuals number of non-responsive individuals and the total number of samples and then again we go through feature selection just like we did for the cell lines and what you'll see is that for Gem City being it optimized at $81.00 features so $81.00. [00:38:34] Genomic features meaning expression levels of 81 genes and for flurry yourself the optimal size was 31 so it's interesting to ask the question what's the deal with these 31 genes are the genes that you would expect to be involved in drug responsiveness and not you know it turns out amazingly enough most of the genes that we find within that optimal. [00:39:03] Set of predictors the function is completely unknown All right so only a small fraction of them have known most of them are not annotated so we don't know what they're doing there is some indication that a lot of the genes that are optimal predictors are involved in this process called a pop ptosis which is a process of programmed cell death in the cells and so when the cells experience D.N.A. damage they have a mechanism that they 1st try to repair the damage but if they can't repair it then they self annihilate the cells it's like an altruistic response to the cell to the community of cells that it's living in well a number of these genes are involved in a pop tart function which makes sense because these chemotherapies are all D.N.A. damaging agents and their intent is to kill the cell and how they do it is not by the damage they impart themselves but because they impart the damage they induce the pop Whosis in the cell and that's how it's actually killing the cells. [00:40:06] All right then we can then after we build the models we can test the models. By seeing whether our prediction is consistent with what is observed and these are the kind of results we get so green means we're predicting nonresponsive and it was non-responsive So that's a true norm true negative and that that red represents mistakes that our prediction made with respect to sensitivity to the drug and then if we plot that those results in a. [00:40:41] Cross So it's a spectrum like this as I showed you before the demarcation when the support vector applause means response of the drug negative means we're predicting nonresponsive So that's plotted out for each patient and then the blue means that patient was reached was observed to be a responder to the drug and red means they were observed to be a non-responder So ideally we would like to see all the reds on this site all the blues and you see we did make mistakes on this but when we look at the overall accuracy it was about 81.5 percent or it was 81.5 percent. [00:41:25] And pretty much the same for both of the drugs so again the models are pretty good they're not what we would optimize in and maybe by people like you can come up with better. Computational techniques to improve upon this kind of accuracy but at any rate at this point we're getting around which is still a great benefit I think than not knowing anything at all about our priority whether the drug is going to work or not and then we went ahead and tried this out on generated our own data on 23 patients ovarian cancer patients. [00:42:02] These were all early stage patients or no I guess not all M Some of them are late stage later stage patients and how we do this is what you see in this is the kind of graph that we would probably envision ultimately given to clinicians So what you're looking at on the blue distribution is the predictive disk distribution of all ovarian cancer patients that we've looked at so far and there's hundreds and there predicted response and so you see it's more or less normal distribution and then we take the an individual patient comes in we do genomic profile of that individual patient we run it through the auger them and then we plot where that individual patient lies on the spectrum of all patients and that shown here in the red line so you can see for patient to 86 we're predicting her to be a non-responder to Carbo platinum and a non-responder to tax a tax on which is the standard of care drugs that are given and down below we look at a value of say $125.00 is what clinicians used to say whether a drug is gone or is responding whether a patient is responding to a therapy or not. [00:43:23] The dotted line on the bottom down there is the day of surgery so is CA 125 level was taken 200 days out from the day of surgery and it was quite high so this patient obviously had ovarian cancer not only based upon radiological evidence but based on CA $125.00 so surgery itself results in a significant reduction in C A $125.00 which is what you're seeing in the drop down to around $200.00 days after the surgery but in then after the surgery effect even though the patient is put on standard of care therapy her CA $125.00 level started to creep up again so that would be our definitive of a non-responder patient in contrast on the on the right on the other side of the graph you see a patient who again dropped down after surgery but her C A 125 level remained low in response to that combination therapy so that's our definition of what a responder would be so again in our more and our models for that 2nd patient patient $336.00 our models were predicting that she would respond and the observation was that she did respond so we go through the whole thing for all of these and we can calculate how many times what we correct in our predictions and how many times were we not correct and again it came up in the 80 percent in this case it's about 84 percent accuracy over these $23.00 patients so we're encouraged in the results so far what has to be done is tested across more patients more different types of cancer to see it holds up now one thing that is kind of interesting and from clinical significance for this is that we find the drugs that are highly effective in the general population for example Carbo platen and about 70 to 80 percent of patients will respond. [00:45:20] Some patients are predicted not to it's predicted that they will not that therapy will not be effective for them so you can see here in this list even though the average response was 70 percent patient 545 was predicted by our model not to be a responder and indeed she wasn't a responder it would be of clinical use to know these kind of things before a patient is put on chemotherapy the standard of care therapy is these drugs and it's 6 rounds of therapy so it's several months and these things are toxic so they're not pleasant for patients to do so would be nice if we could predict ahead of time which drugs are likely to work and which aren't without going through the trial and error of having the patient go through the chemotherapy also we find that drugs that are only moderately or poorly effective in the general population like Jim city being which only 57 percent of patients are predicted to respond to it yet for those patients that are predicted to respond that's an important thing to know from an individual point of view. [00:46:30] So we think that at the current time no clinician is not going to give the standard of care therapy because this is what because they're afraid of lawsuits because this is the standard of care used in the community so they're going to get that now where we think the application of our model would be in the short term would be for those patients that fail standard of care therapy then it comes down to you what should I do next for this patient and the way it is now it's basically a crap shoot it's they try all kinds of drugs let's try this one wait for a couple months no let's try this one and so we think the our models could be very useful in predicting 2nd line therapy since it's it's completely up to the clinician anyhow to decide this would be a piece of informative information that may help them in their decision what to use. [00:47:26] All right then I just want to point out even though all of the models we built so far involved with sort of D.N.A. damage type of chemotherapy which is the traditional type the new kinds of therapy that are coming out so-called immunotherapies or using the immune system to attack cancer there's also variability between individuals in their response to immunotherapy So again there's going to be a genetic basis of that so the same approach that we're taking here could be used to predict the responsiveness to immunotherapies provided people like you and me get access to the large data sets that we would need to do it all right so then in the summary the ultimate goal of cancer research is to understand the underlying cancer underlying causes of cancer onset and progression and to use this knowledge to generate accurate diagnostics tests but with few notable exceptions our cause and effect understanding of the process is the most cancers are limited machine learning is a proven method of generating accurate predictions of models of correlative trends detectable in large data sets and so the application of machine learning may constitute an effective interim method if not a. [00:48:41] Final method in generating a diagnostic and therapeutically relevant predictions. And so I just want to do my thanks we of course have to thank the people that provide the tissue to us in the 1st place that's Ovarian Cancer Institute in the C.E.O. is Dr Benedict and you know on the next level down of analyzing the data on the metabolic profiles I have to thank for kind of Fernando's David Gall who are the analytical chemist experts that contribute to this study and for the computational side on a machine learning on the computational application of machine learning for drug production on the cell line studies that was a collaboration with Fred been very unkind Hwang and then the work that's going on now that we're doing for the $800.00 samples plus additional samples of patient samples and patient data for genomic profile since being carried out by the people at the bottom and 5 of my graduate students with the guidance of the 2 individuals you see here with the faculty members at Georgia Tech in computer science and then the bottom just points thanks the people providing money for the research. [00:49:58] That we do so thats it. Thank you this is very interesting talk so in the very 1st part of your talk you mentioned that. You characterize the response of the cell line based on the I.C. 50 which is a continuous variable so why don't you direct me build a model to predict I say 15 right so in the models we were building in this is because of our night to being not even an I'm not an expert computer science but we had to have binary input into our models so we had to have our viewer respond or are you not responding and so we had to have some way to define from a normal distribution Who should we say are responders versus not and we just arbitrarily really chose anybody who is half a standard deviation below the mean as a non responder happy deviation as that responder and that's what we built the models and I think that these models could be built on the quantitative distribution and I think that's something that that should be worked on it's just that we didn't do that in this particular case the fact that it's working on binary means I think what you're suggesting would even result in a more accuracy. [00:51:28] You know another question is in the 1st part of talk this press socially experiment in the 2nd part is able experiments. Do you see any correlation between those 2 part so say that I didn't pan out what say it again I did so for example the sale I response versus notation response do you see any Yeah so the 1st set of patient response I showed you where we did the distribution that was based on models that were built with the cell lines so the cell line models were predicting accurately the patient which was surprising I certainly didn't expect that but it did work on the 2nd part of my talk we were building the models actually on the patient data but the accuracy we're getting either way didn't seem to be that tremendously different so I'm surprised that that that the 1st part was just an ovarian cancer study so that the models that were built only for ovarian and test it against only a very cancer I don't know if that's going to hold up for all cancers but I was surprised that the models built on the cell lines were quite good at predicting patient responses well thank you. [00:52:48] Thanks John part of very good survey of all the work you have been doing Thank you. You mentioned that you have 800 patients for ovarian cancer for which there is data available clinically as well as you know Nic data similarly D C J has like about 811000 patients write something like that so those datasets are still very limited. [00:53:14] Well you can get access to gene expression data. Normal Jase No but I'll tell you no response to the drugs is not in that data the one I was getting at is that there is very good efforts at the hospital level. To collect clinical data on the patients right but there's not a very prevalent effort to collect genomic Yeah not not the kind of effort I'd like to see so doing I think there will be some movement it's happening now and people are starting to collect that data and I think drug companies have had that kind of data for a number of years it's just that they don't release their own lives right now but we need to get access and it has to be made open I mean I didn't begin our 1st paper when we published our 1st auger and then we made it open access because most people are not if you think you've got a good predictive augury them you right away start envisioning the yacht you're going to buy and travel through the Caribbean so people do not openly share it but the only way we're going to really really make progress is if we open not only the aldermen thems in the scripts that are used to make them but also the databases and I think in I.H. is definitely not pushing towards that so I'd say in you know the next few years there's going to be a lot more data available to us. [00:54:45] Thank you for your talk is really informative I have 2 quick questions my 1st question would be I know you like that the biomarker C A 125 I was wondering did you look at any other biomarkers that I know one thing about of being cancer or any type of. [00:55:03] Gynecological issues in C 125 can be high so it's not as there. Right so we use CA 105 we are we also have radiological evidence and stuff like that and for the patients we looked at and we only had 23 patients the CA 125 What's correlated very well with the radiological data and so forth you're absolutely correct in saying that C A 125 is not complete it's not good as a predictor of disease but it is pretty good to see how patients have responded so they use it clinically you know after they treat a patient to see if there's a recurrence occurring it's pretty good for that but it's not good as a predictor in the mainstream and my 2nd question is just on the distribution of the air datasets in terms of race just because I know that is a limitation Fortunately with data and I know that we especially want to take it's like pushing towards looking into more specifics when evaluating race right so that the Unfortunately most of the samples that you get from these tissue banks are of Caucasian so there's relatively. [00:56:21] Limited ability to test racial effects on this or ethnic effects or any other kind of effect we're hoping at least for the. Metabolism next stuff because now we have 800 patients samples there and there are there is reasonably good representation of different racial groups it will be one of the things that we we will look at to see if there is any clustering there the data so far suggests not at least for the diagnostic thing certainly we know that there are facts prostate cancer is much more prevalent in African-American men than in white men so we know that there's going to there's likely to be effects for the diagnostics at least so far we don't see it but if it does show up what it means is we will just build a model for those subgroups and so a patient would come in and if you're in that subgroup that's the machine learning model that would be used for you so I think it's something we can handle that's just we don't know yet thank you. [00:57:24] Do you understand the cause of pathways that from the variables that are selected to the outcomes that you're trying to predict and. Do you think that would be important before you implemented in any kind of clinical decision making setting see which So I think the answer for most cancers is no we don't know but that's the whole point and what I tried to show that if we try to impose our pre-filter the data based upon what we think we know we end up getting a lower accuracy than if we throw than if we had just thrown everything in there so I think that's just a reflection of our ignorance at this point now I think the machine learning one we come to the optimal set of predictors then we can look at what they are and as I said a lot of them now are things that we don't know what they're doing but that could prompt molecular biology studies on those genes to try to figure out so that machine learning not only can be an interim solution in terms of prediction but it may be a guide for future research to understand cause and effect relationships which is ultimately what we want to get to. [00:58:39] But this is the this might be an unfair question for you this is a warning so how close do you think we are to having having a deployable tool that could be used by people who aren't experts in machine learning and not necessarily to make a diagnosis because I know there are problems with that regulatory problems with that but it's something that. [00:59:02] A clinician could use in conjunction with expert and so. When I 1st started doing this for I was looking at patients that did not respond to 1st line therapy because there it's it's going to be very difficult at this point in time to overthrow the practice of doing. [00:59:23] Standard of care therapies with. Machine learning predicted things because people just don't want to trust it that much yet but if you're talking about patients that have failed 1st line therapy where what are they treated with after Right now it's computer is purely at the discretion of the physician what to pull out and it's just trial and error that's what I think the sweet spot is for the application at this point in time however there is regulatory issues involved legal issues so what we're thinking of doing is just simply presenting the results to a clinician and not saying we think you should use this drug we just show them where their patient lies on the spectrum for all these drugs and the red line for one of the drugs will be way up here we're not telling them to use that drug we're simply providing this as a piece of information just like a CAT scan would be another piece of information and then the clinician can take that information and use it in their final decision where the legal issue would come in as if we went ahead and try to say to them we think you should use this drug and we can't do that without F.D.A. approval that this is the but we can present data and just say this is what your patient looks like on our spectrum take it for what it's worth I think we can do that in the short term and as we get more data in the accuracy should get better not worse in the future absolutely in the future this is the way it's going to go I mean there's no way around that but it's going to take a while to get F.D.A. approval on and to get over all of the hurdles that are there clinicians need to be educated more in the power of this kind of. [01:01:24] Technology it's just maybe it's not taught and medical schools as much as it should but I think of the actually maybe in my lifetime but I doubt it. This will become standard of care approach so and then trust of time going to go ahead and conclude that.