[00:00:05] >> Have a. Joyous little problem the more you. Work for more than your. Wrist holds. The mining machinery and you go through this and it's good to get as you remembered the talk about. The marble. You know the price for it. On the bike but I think he's all over the world today also intrigued with some of the. [00:00:36] Fancy things or by tradition. But I think we're going to be. Well thank you. Thank you so today I'm going to talk about building healthier tenser models and looking at how you can think about it and motivate it from learning computational phenotypes from data so I wanted to briefly just recap probably what you know already quite well which is that there's been a lot of work towards computational if you know so looking at how I can better to learn patients subgroups are high can better identify some were patients who might react well to a certain medication and not well to another right. [00:01:19] And so you can take all this electronic health record data that's been popularized in the last 1015 years and the idea is we're going to throw a bunch of machine learning algorithms whether it's tensor factorization learning you name it right and the idea is that at the end what we get is if you know types Let's let us differentiate between any 2 patients or let us figure out you know what who am i similar to and what might work for different patients. [00:01:48] Now in terms of machine learning algorithms there's the 2 different paradigms Fred which is to talk about supervised learning and unsupervised learning so the down side was thinking about supervised learning is that you actually need to annotate all this data so in other words in order to run a supervised machine learning algorithm what you really need to think about doing is somebody has to go and review every single case study for all the patients you have in your court right. [00:02:18] Now on the other spectrum you can think about unsupervised Hour them right which is what tensor factorization falls under and there are some issues to think about that's related to that but they ideas I don't need an expert to tell me whether they belong to case or control I'm going to automatically discover interesting disease sometimes. [00:02:39] And so the idea behind using tensor factorization is that you can now represent higher order interactions so I can think we represent whether or not a patient takes a particular medication to treat diabetes versus whether or not they take something else to treat type or tension right and then if I see compose it right in particular can to compare a factor C.P. has this very nice interpretation so I have all our rank one tensor right and each rank one tensor is the outer product of different vectors right so in this case if you look at a 3 dimensional is charged tensor along one side I might think about the procedures that are prescribed for a patient and then along another dimension I might think about well what are the diagnosis they get if I actually use tensor factorization to decompose it I get this very nice interpretation which is that if I look. [00:03:38] I'll get are different subgroups right and along each subgroups what I can look at are the non-zero elements and the non-zero elements allows me to better understand well what are the clinical characteristics that are associated with this subgroup and then if I look at the patient dimension I can understand well which patients actually belong to the subgroup. [00:04:02] And then if you just directly map it out you can get a phenotype that says OK I have 22 percent of my patient population belongs to this. And then I can rank order along each dimension that I'm looking at so diagnosis I can say well hypertension is more important than hypertensive heart disease and along the procedures or even medications red I can think that beta blockers is the 1st thing that's used to treat these patients and then thinking about calcium channel blockers and already told me that he's covered limestone and marble and the takeaway really is that it really allows you to do data driven disease subtype discovery with minimal supervision right and in modeling this interaction between disease and medication or diagnosis and medication I can actually get very interpretable factors So in other words rather than thinking about lots and lots of non-zero items because sensors are sparse right when you represent them as a chars they actually help give you this very succinct definition that you see on the side. [00:05:13] And we had clinicians annotate it and it turns out that the phenotypes that we discover using this data driven mechanism is in fact generally clinically relevant So roughly speaking anywhere between 60 to 90 percent are typically annotated as fairly good. Now just to convince you that I'm not the only one that's popularized this since 2014 there's been a lot of work towards thinking about tensor factorization in the context of computational phenotype And so this is somewhat approximation of what's been done to date what I wanted to highlight right is. [00:05:56] The name of the works that is associated with the model the type of tensor factorization that's done right so you can see C.P. is by far quite popular there's 2 that are pair fact to write one is chemists and one is already and then one Tucker recently done now and then in the 2nd column is the las So you just mill year with what last means no yes no. [00:06:33] Yeah what this last mean. It's. Got this right so it's some proxy for how you want to think about the data in this sense right so the common one that we all love to use is least squares or for being you know why that's why do we use Gaussian models in general right why why does Gaussian make sense and the thoughts actually since you already asked a question I will pick on someone else. [00:07:18] But. That's where a lot of things when there are enough samples return commercial gas and but there's something else that. Has that and in particular thinking about you have any thoughts. You're very close right so what I wanted to highlight was he covered simple distribution so in fact as you know has a very nice property right most of the times you have closed form solutions for how you can solve them right and close form solutions means that it's computationally efficient generally speaking right and so what I wanted to highlight right in thinking about the loss is that even if I'm thinking about scaling so this is looking at how well can I scale conserves to beyond the save 400 you know closer 240-0000 patients you'll notice that with all of the ones that are focused on scaling for being used as the one that's chosen because for being yes has a nice closed form solution. [00:08:35] So and the key thing to note is actually if you think about were being asked comes from it comes from gas soon and that means that you're really modeling numeric data rate and numeric data maybe what I'll show you later is May not always be the best approximation for everything and then the next thing I wanted to highlight was there's been 2 works that have focused on how you can think about expert guidance right so this is in other words if I take an unsupervised system how can I bring in college from domain experts and so the 2 is rubric which is looking that will if I pre-specified that I want certain phenotypes to represent certain things I have to tell the algorithm that ahead of time. [00:09:17] And then in the 2nd work which is supervised. Similarity what they do is a model plus cation task right and what that means is you actually need labels right so they incorporate a logistic regression into the last function and that means that I need to predefine it so some expert actually has to go and tell my algorithm What should I know what should I find right and the downside to this is that you can imagine that I might be biased in my algorithm towards such results right so rather than going from data driven I might actually be more knowledge driven depending on how important that is. [00:09:56] And then the next thing I want to highlight was multi-site So there's only been one work that's been looking at multi site and it doesn't scale quite well right and this falls under Federated computation which I'll cover more later and then the other thing to think about is how can I think about incorporating constraints so what you'll notice is and all the others right i'm are plus means that it's non-negative. [00:10:21] In this one you're really looking at similarity right and so in this case I'm trying to cluster at this the groupings the factors that I find to incorporate similar items to one another and so this requires conceptual difference looking at a word to vex model from each other so this is sort of set in the landscape for the work that I'm going to introduce today and so the idea is can we develop models to incorporate knowledge without bias in the results so in other words there's a lot of expert knowledge that's out there how can I take advantage of it was not necessarily bias in my results ahead of time the 2nd point that I want to point out is we want to work for a wide range of data types in other words we don't want to focus just on the for being us but really thinking about other lost functions so not just in America data but in if your data will be in data exponential data. [00:11:19] We also want to scale to lots of samples under the previous paradigm and then also thinking about working in a distributed data setting. So 1st I'm going to detour away from considers for maybe about the few minutes and talk about one challenge with thinking about unsupervised learning and that's how do you actually assess computational phenotypes not typically the way we've done it is where you have all this electronic health records you pass it through some phenotype in process whether it's tense or factorisation deep learning or. [00:11:55] Whatever and you generate these candidates you know types Now the downside is they've been they haven't been verified to be clinically meaningful So typically what we do is we send it to a panel of experts who then verify OK this makes sense or not and then at the end of this process we get whether or not they've been verified. [00:12:19] Now what's the downside to thinking about a panel of experts and any thoughts. Their bias that's correct right so they only know what they know that's a great point and the other thoughts. Labor costs are expensive right so if you get a $5.00 medical doctors every hour you're losing about a $1000.00 right maybe I'm looking for one more point. [00:12:57] Do you and your friend always agree on the same thing sometimes right sometimes right but more often than times probably not right that's why maybe they're just your friend and not your best friend right. So the idea is not only are you biasing the algorithm right they might be biased towards the Can something is meaningful or not there are disagreements and it's time intensive right and so the question is can we really rethink this process right is there some other mechanism that allows us to assess phenotypes without actually having to encourage this disagreement or this time intensive cause. [00:13:39] So what we've introduced to is the future type instance of validation and evaluation tool or pivot. So the main idea behind pivot is that in public there are 27000000 scientific articles right and that's a lot of knowledge that exists out there that you can readily crawl for data. [00:14:01] And so the goal is rather than thinking about having a physician assess it right or at the man expert access it as the 1st line of defense we can actually use a proxy if we can actually look at whether or not there's evidence for this in scientific literature and so what we've developed was this pipeline for actually gathering this evidence so there's It sounds really easy and concept it's actually quite difficult in practice and so there's 3 different parts that we have in the pipeline which is to think about how can we generate synonyms to have better recall so in one journal I might refer to something as a heart attack in another journal I might refer to it as an acute myocardial infraction right how can I actually get those 2 articles and get that they mean the same thing right and the other thing is then we just calculate the strength of an association using currents analysis So in other words if 2 things occur in the same article then they're more likely to be significant. [00:15:11] So the 1st thing is to think about synonyms generation and so I'm going to concentrate on a phenotypic representation so you can imagine that this is a single factor that is nonzero in a tensor but it's basically any set of clinical characteristics that you're interested in evaluating and so the goal of what we built was this concept so let's say that I give the word hypertension to this it's going to go ahead and Querrey a database in Pub Med that allows us to find similar synonyms that might occur in different articles and you'll notice actually in the top right mostly they have the word hypertension but down here you actually have pre-eclampsia right and some other really complicated word that I'm not going to try to pronounce. [00:16:02] So in other words I'm generating a set of synonyms that I might possibly want to include in thinking about the problem and then we're going to go ahead and query all 27000000 articles to figure out what which ones actually make the most sense in the context of what we're trying to do so in other words will rank hypertension malignant as something pretty close and you notice that pre-eclampsia is a bit lower. [00:16:30] So then once you had these synonyms Well we're going to do is we're going to analyze the $27000000.00 corpus items right and what we're going to do is we're going to use lift so for those of you that are unfamiliar with Cliff The idea is to determine how much more often is the intersection occurring compared to each individual item itself so in other words if you think about the words hypertension and maybe diabetes they occur in the corpus a lot right potentially now you're going to divide it by whether or not the rate at which it occurs individually So in other words if hypertension and diabetes constantly occur together. [00:17:10] But individually I had expected to occur at some specific rate if they were independent if it occurs a lot more than I expect the list will be high. And so then the goal is to just generate a single score associated with the set of items that you give it write a set of clinical characteristics and it will just give you the lift for all the current says and then at the end what we have is we have this classifier that term is whether or not it's likely to map to clinical significance. [00:17:42] So we gathered 45 phenotypes from different papers and also from the and the Titians that we've had and there are 7 that are annotated as not significance and we derive different futures from lift and what we've discovered is that if you use Logistic regression model you get an F. one score of 0.87 K. nearest neighbors give you 0.9 and then you can see a you see is slightly better so logistic regression is the model that we stick with so this means that our pipeline actually generally speaking maps quite well to what an expert the finds us clinically meaningful. [00:18:23] Now the question might be well how can I think about using it and so what I'm showing you is 2 different annotations. From 2 different candidate phenotypes that were mapped as unsure So in other words the end of the set of entertainers either disagreed or weren't sure exactly whether or not something was clinically meaningful. [00:18:44] And so what you notice right is the diagnosis and the medications in that group the comment from the editor who said it looked like in the rhythmic heart patient but they still weren't entirely certain Now if you ran it through our. Our pivot score what you see is that we actually think it's likely to be clinically meaningful So in other words in the situations where the imitator may not be sure we can actually gather evidence for how they can think about it right that might sway them one word or another using scientific articles that have been peer reviewed right and then the 2nd one is one that they were even more unsure about which was maybe it was lung disease but you see that question mark at the end and you can see that if we run it through our scoring mechanism you can see the list is actually significantly lower right and also our thing is also unsure as to whether or not it's meaningful and so it allows us to find some mechanism to fine tune what the experts have and then tape it or even serve as a proxy for the experts at least you know when your initial A thinking about it so what we created was this data driven mechanism for assessing the phenotype it's comparable to human discriminators human annotator is in terms of discrimination and we've built it so that it provides relatively realtime assessment of individual or batch phenotypes Now you notice I put real time in quotations Roughly speaking it takes about an hour and a half to end a tape but you can think that's crawling 27000000 articles right so it's going to take some time to run out really. [00:20:28] So many questions about this part hopefully make sense this is the only non tenser part right so the question is that. Did you have a question at your church. Since. I was. Trying to. Is like what would you do. So our original We originally tried were to back and looked at context and it actually failed quite miserably. [00:21:04] Because the issue with that is that a lot of times there are multi-word diseases and so what we've decided as a 1st pass them to sort of prototype it is to use the mesh database so the mesh just has like all these defined terms and it actually allows you to find it draws an ontology and so the synonyms we despite the find as similar locations within the same tree so that's how we ended up calling it so what you'll notice is they share very similar number structures so 907489 is one grouping and then 381 and. [00:21:41] That's correct we've actually there's a student of mine that's been working on looking at joint representations and we've discovered a much better one and so the goal is to swap this out with that one using word embeddings but we're not at that stage yet. Because one was with the school so. [00:22:05] It's. Really there is actually no capsule connection you go very very high school. Of the future as if it's. It's going to be something. That's right so originally Actually I'm showing you the list score just to show you that. If we calculate the lift it's this but we can actually run statistics from lift and have it spit out a number between 0 and one right so lift actually the issue with lift is you have to compare it with others and that was the downside with our original work and so we put a classifier So now it's puts out a score between their own one. [00:22:52] And these are all practical things that we've discovered along the way. And the other questions. OK so now i'm we have this basis for assessing the candidate types Now the question is can we use this and think of it as a way to get knowledge into our tensor factorisation rhythms. [00:23:15] And so what you might think of as potential knowledge is that you might if you have type one diabetes you should not have type 2 diabetes right this is a statement that you would find in a scientific article maybe not quite enough form right but paraphrased into layman's language would be that if you take medication A You should not take medication. [00:23:38] And if you are pregnant right more. You should not take medication see right so you see all forms of this type of knowledge that exists in the richer for it. And so the question is Well how might I think about incorporating this so one option is to think about this existing knowledge right in this form you're going to add it into the tensor factorization model and then what you get is maybe something even better than the original. [00:24:08] Now the downside to thinking about this model where you pre-define it is that I'm actually going to potentially bias my model towards this expert knowledge that I feed it. I assume that knowledge is actually well established which I don't know about you but scientific articles constantly have a missions and there's lots of problems with scientific articles and so maybe we might not want to say that everything I've seen in published literature is true. [00:24:38] And then also requires lots of annotations for it so some person has to come along and actually Mark OK This means that these 2 are related This means these 2 are not right and you might be able to do it by running something like Word to back to get conceptual difference right but it takes some work. [00:24:57] And so the idea that we had was we built this great pipeline for evaluating phenotypes can we actually use it to improve our tensor factorization models but not improve it at the beginning necessarily so the idea that we had was let's figure out pairs that we want to think about printing right pairs of factors that we might not want to exist together check what the relationship is using pivot and Pub Med So find the strain right and then once we figure out the strain to update the tensor factorization model to reflect how certain we were that these 2 things should or should not exist. [00:25:41] So the idea behind C.P. quick which is $0.50 or to composition with cannot link intermodal constraints is that we now take our original model right and we have these ranked one components now rather than trying to assess the strength of every single pair of items we thought that if you think about it from the perspective of understanding tensor factorization in the factors right but ones that are high components in other words the ones with high probabilities are probably the ones that mean it's really probably prevalent in the data. [00:26:20] So in other words if you see a lot of 0.9 You know as the probability that that factor is turned on you might not want to prune that one right because that data is really driving you towards modeling that aspect and in fact the ones that we identified should be pruned are the low probability elements So in other words they're the ones that we're not really quite sure of that the models really thinks maybe they could be maybe they can't be right so we take all the low probability elements figure out the pairs right and then run those pairings through pipit. [00:26:58] And the idea is we're going to generate this cannot link right and so the goal is if there are low probability pairs I don't want them to be linked together I don't want them to be turned on. And then rather than just setting them at that point we're actually going to analyze a strain and if we find that the strength is actually quite strong we're going to go ahead and say that cannot link relationship should not be there. [00:27:23] So in other words if we find that some medication and some disease really have some relationship even though they're low probability will still keep them there because it makes sense there's evidence to support it and so the idea is we're going to use a scan not link matrix to reduce low probability elements and in some sense clean up our results a bit. [00:27:47] And then we're going to go ahead and use pivot to accept and reject cannot only constraints. Comfortable with the basic idea. So we're only accepting or rejecting things that we think have low probability of success right and generally speaking won't bias our model too much so that the way we formulate it is that we 1st we have this big optimization right so I'll walk you through the big important part so the 1st thing is to think about expanding beyond just the for being a norm and so what we adopted is the Bregman divergence now pregnant divergence actually encompasses a large class of data types before that we really wanted to concentrate on were numeric right which is the gaps in formulation boolean. [00:28:37] Which is Count Data an exponential. The 2nd thing is thinking about the cannot link constraints and so what we introduced was this new formulation that allowed us to put cannot link constraints not only within the same mode so in other words I could include information like diagnosis A should not be with diagnosis speed I could also include information like diagnosis A should not be with medication be so jumping across different modes. [00:29:06] And then we also have what's known as angular constraints so the idea is that if I discover something interesting. I don't want it to have a lower probability of occurring in in another phenotype but not have them be orthogonal which might be too restrictive and then you can actually just specify what percentage you want so we ran C.P. click on a synthetic dataset and what I'm showing you is if we generate data that's real count or exponential. [00:29:42] If we run the standard C. P.L.S. algorithm right which is what everybody uses it assumes of her being in a storm you know see if this is the reconstruction and one closer to one is better closer to 0 is really bad. What you notice is for real data I'm actually very good at reconstructing the original data if I use the Payless OK now once I start moving to different forms of data what you discover is that my performance suffers in other words I'm no longer able to find the original known factors that I had. [00:30:20] Now if you run it according to the Bregman divergence that's associated with a lot with the type of data you have so in other words if we run it with real if we want count dado at the end if we run exponential with exponential loss you'll see that in fact. [00:30:37] With no knowledge I'm actually pretty decent recovering my original factors now if I even got it slightly right where I tell it something should not be linked together I actually do the bust. So what you should take away from this is that it's really important to model data with the appropriate loss we always think about using for being in and asked him because it's convenient but in fact when you are thinking about modeling data we might actually want to take the extra time to use an appropriate loss which. [00:31:13] Otherwise synthetic You know you're not doing very well recovering OK yes 100. 000. So we use the cast of the scent. Because it's actually the easiest. And so that's why you see a little bit less. Comfortable. So then we actually wanted to see well can I actually use side information to have better if you know types that allowed me to do better prediction down the road. [00:31:55] And so what we have valuated on was a 1622 patients we had about 1325 diagnosis and 148 different types of medications now of the 1622 patients 304 have been identified as case in 399 were done by just control and this was done by a panel of domain experts OK And what we did is we take the patient factors and we pass it to logistic regression that's regularize with less so right and we learn the last of parameters using 5 fold cross validation. [00:32:34] And we ran up bunch of baseline comparison models right and so what I wanted to show you is if we did Rubik which is in forcing orthogonality you can see the you see a 0.5398 so you're taking someone thinking about the least squares and you're also picking some loss in thinking about orthogonality now the supervise which in codes the label directly in through optimization algorithm writes in some sense you can think of this as cheating right because I know ahead of time what the label should be and I'm incorporating that to find distinctive and discriminative phenotypes you can see that it gets 0.6466. [00:33:19] And if we did granite which is looking at. Calle divergence So looking at modeling count data and having just angular you can see already we're doing a little bit better than even the supervisor which uses us right so even supervision with the wrong data type still does not help you much. [00:33:42] And what I'm showing you is 2 different versions of safety quick so one shot means that I prune the cannot link matrix all at once and I specify ahead of time using all the knowledge that I have so in other words I ran all the CAN 25 diagnosis and medication pairs and I built all of those which is very expensive. [00:34:03] And. Quick gradual is the idea that I'm slowly going to accept and reject depending on low probability elements and so what you can see is that I actually do much better in coding some domain knowledge right and if in the situation where the knowledge is good and I'm willing to pay the computational price up front of encoding that cannot link you can do it with the one shot right. [00:34:31] And so this is just to give you an idea of the results that you get back so on the right is the current phenotype which is the 2nd best baseline and then on the left side is C.P. click which used to be called pivoted granite and what do you notice any anything other than you see a lot of words that don't make sense maybe right but quantitatively what do you notice it's. [00:35:17] See that's right. That's right so on the left side right what you'll notice is the left side is much more saying right because actually what you see on the right side there's a bunch of stuff that we're actually low probability elements and so in other words it's going to go ahead and remove the ones that actually have no relationship right and really are just there as a function of noise. [00:35:43] And so quick is really more focused and in fact we pulled 2 domain experts and they said that it actually aligns with the definition of a typical cardiovascular patient whereas the right one there's just too much stuff that exists on the side. And we also had them in the Tate the different versions and what I wanted to showcase is the not significant is the really life blue possibly significant and significant Now what you're going to sacrifice if you do grant is that you'll have a lot of significant and then a lot of not significant and if you actually just think about modeling the low probability elements and trying to make them more sparse you're actually going to pay more of a performance hit and then if you incorporate the 2 different versions what you'll notice is the not significant portion is it's significantly decreasing. [00:36:39] And so in other words this allows us to have some mechanism to incorporate unsupervised guidance into the process you notice I put unsupervised because really there is no human annotator telling me this is the knowledge I want to encode this is what I want to put into the algorithm. [00:36:58] The information can be for both within and between modes which was typically not modeled before and there is also the potential to extract more relevant phenotypes as evidenced by this. And it questions about this one yes so is. This like. It was this moment gradually so the idea is well I may go back a few so the one difference in one shot right is that in the gradual I'm going to slowly build this cannot link matrix and then I'm going to print it so each time I put a new element as one I'm going to go ahead and check it now you can imagine that that can pick some more time and if I wanted to just precompute this whole cannot link matrix right I can go ahead and do that and so the idea is anything that we don't find any significant relationship in scientific articles we included as a one right and any time that there was a strange we can quote it as is or so becomes actually a very dense matrix so that's that's a sacrifice that's a trade off this is just so. [00:38:13] Folks are. Trying to understand. Why it's wise to see to look surprised works. Worse than if you would since. I was going to. See you so basically the. That's right school. We use the. Government down. Here. Comes the computer secretary he says that the maintenance of these is. [00:38:46] If you give him the benefit. You know the. Difference. So I don't have the exact numbers but in the runs that we did before actually C.P.A. P.R. was between granite and C.P. quick. So in other words what we discovered in the process was actually modeling the Count data actually provide a better more informative if you know types and. [00:39:12] I can talk a bit more about this but it actually turns out that even if you look at their original paper the boost you get and you see from the supervision is actually very little typically. Like one percent. Even in our results we had the next set of results I think it's in there and it's very minor. [00:39:32] So the Pennine on the the type of past you have the benefit so they did it on their original paper was on mimic 3 mortality and their boost I think was like 0.01 and you see from C.P.R. I believe for one of those Yeah it was you. Know this is the. [00:39:54] Waters. Yeah I think so yeah. The. Problem. Is this it wouldn't be a problem for them. So we set it to you consider to be a bit more aggressive It just really depends on how many queries you want to run against Pip it because each time you update the cannot link it's it requires you pull for the relationship and so that's just how often it's a trade off between computation and how much you won I think we I forget exactly what we said it to but it was relatively like it was like below 0.05 if I recall. [00:40:43] Sorry so comfortable with sippy quick so the next thing I want to think about right is if you ever decide to do tensors maybe I'm not selling conserves Well right but it's actually quite expensive so I had that did an X. Case see the comic write basically if you go to run a tensor motto Sometimes you wait 20 minutes 30 minutes if you're lucky. [00:41:09] And in fact I'm going to pick on Ari and Kim is here so we actually look at their paper so this is the Copa paper their results on 250 code bytes of RAM right. And for it right with 10 scores and I think numbers is like 20 minutes right roughly speaking and I'm going to pick on chemist who has one terabyte of RAM right and 2 years ago in kiddie rides if you really think about it running tensor factorisation models is computationally expensive right. [00:41:45] I don't have the privilege of having one terabyte of RAM or even 200 gigabytes of RAM at my disposal and so the question was can we do a little bit better right can we lower the memory can we think about distributing the data now I'm going to focus purely on C.P.P. composition right because there's a whole lot of work that exists out there and there's a nice Asterix because I don't cover everything because if I did it be really hard OK. [00:42:16] So this is looking at distributive tensor to composition right and all the work that's been done related to it is so similar before you're looking at the last function the constraints right and so what you'll notice is loss right there's only 2 that actually claim to even think about kill divergence and inflect In fact if you look at the results of flex a fact there is never any hint that they tried any experimental results on the per cube actually has it right so kill of divergence is actually really hard to paralyze right I mean this is really the punch line so in thinking beyond American data it becomes much harder to think broadly. [00:42:57] If you think about constraints you're really only looking at non negativity and having the 2 norm. Which is also computationally efficient. So very few support constraints. If you look at the algorithm right there's only really 2 examples of S.T.D. right so as to does it really fully explored in the context of concern scaling it and then also what we wanted to think about was thinking about platforms right so you'll notice that some of them are in Map Reduce predominantly to our in Matlab which I wouldn't call very distribution friendly right and really sparks much better for iterative applications and so the idea was could we build a spark implementation using a shitty that might work much better for a wide range of data types right and in particular we wanted to focus on count OK So this is the work S. granite which stands for Sparc granite screen and distributed tensor decomposition for a large scale health analytics and this is actually the Our them that we focused on and what I wanted to highlight is that we did only Cayle divergence for this because a lot of work's been done in looking at these scores and for opinions and then also we wanted to explore the impact of constraints right the idea whether we can incorporate sparsity diversity and also thinking about discrimination in the form of logistic regression. [00:44:33] And the idea behind a screen it is fairly simple you want to. Cast a gradient descent by definition is sequential right you guys are all hopefully familiar and you take one sample you can't get the gradient and then you update your factor according to that gradient right and then you iterate. [00:44:56] So the idea is how can we actually paralyze it and what you can do is you can actually the vide your tensor into what's known as stratum right were each stratum is not overlapping with the other and so in this case what I'm showing you is strip block one is looking at the front right and then block 2 is a distinct lock on the back and the key thing to note is when you do it this way actually there is no overlap between the factors that you're analyzing So in other words the patient factors will be the think that diagnosis factors will be distinct and the medication factors will be distinct and so in this way I can actually easily paralyze a city in this using this notion and in fact flex a fact uses this to do it for the squares. [00:45:45] So you run in parallel and we put it on SPARC. Comfortable right and you can derive the gradients for any of the regularization is that we were thinking about right so as long as you can calculate gradient for this you're OK. So what I want to show you. [00:46:07] Is 2 different data sets so on the left side is influenza so if you actually look at search results for flu right Google published that you can actually look at well how often was flu occurring and what's the likelihood of prevalence and what this is showing you a C.P.A. P.R. which is a centralized version begin about Count data and that's in the green flecks a fact is the version that you distribute across different machines right and we actually ported flex a fact to spark so that it could be a fair comparison and this is showing you right based on the epix that you run this is what you end up with. [00:46:50] Granite is the one that has the angler constraints and Simplex And then if you look at S. granite what you can get is that we actually converge much faster in less epics right and this is across a wide range of machines so the 2 that are distributed as flex of fact and S. granite. [00:47:12] And if you actually look at mimic 3 we actually couldn't get C.P.S. P.R. to run in fact it failed right and so what you see is the comparison between flex a fact granite granite and aspirin and right in the same case appears there as well which is that we could easily paralyze much less time right using an appropriate algorithm for Count Data. [00:47:41] Right so as granite achieves lower Cal divergence in less epics right which is a great thing so then now the question is will find it's faster are the factors actually better right and so what I'm going to show you is for the influenza data this is what flex the fact gets right and what I want to highlight to you right is actually not that easy to think about so we use a small rank and what we would guess is well it seems there are some peak at week 20 right in influenza searches. [00:48:19] And you can't really tell across the different regions right and yours also unclear now if you actually smooth right and sparsity and also thinking about diversity these are the results you get using our algorithm. And what you'll notice is that some regions are on for a factor to write some regions are off for Factor 3 and so on and so forth which you don't see in flux in fact. [00:48:47] And also the thing that we wanted to highlight was if you look at when it's turned on right you'll notice this is turned on at 20 in fact that 20 of this is actually turned off so in fact the 2 hour them's are getting something completely different so which one is right right if you actually look at the C.D.C. flu counts OK what you'll see is that it starts to rise right around week 50 and by week 10 it's starting to go down right so in other words if you look at the the sky factors from search data using a screen it's actually more consistent with what has actually been published and observed by the C.D.C. It's another with this this awkward peak at 20 you'll see it 20 there's almost nothing here. [00:49:45] So in fact it's actually learning noise whereas if you give it sparsity and you give it diversity it's easier to learn something useful. And we actually also looked at predicting mortality so this is giving you an idea of a you see the time it takes for 4 for 4 workers I believe and then looking at the average overlap between factors so you can see the C.P.A. P.R. get to 0.63 right and then if you turn on discrimination you actually get all the way down here and angler constraints right and if you look at the time it takes great with the angular office a bit faster with angler on it's slightly slower but you also get better or lower overlap which is a great thing yes. [00:50:36] It's 38000 by I think 205200 something a problem along those lines exactly what it is I forget them. So that means that as granite finds discriminative and distinct types in less time than flex a fact so in other words you can incorporate constraints in a distributive setting and still find very good results OK. [00:51:02] Yes as you said you. Read it for C.P.A. P.R. is not in SPARC so C.P.P. ours is just a single single node Python Yeah that's a fact isn't SPARC Yeah it's a python Yeah but Flex a fact and S. Grant are both in pipe spark so it's a slightly better comparison. [00:51:30] And then if you actually look at ideal speed up what you'll see is that the ideal speed up right is this mimic actually is pretty close to a real ideal speed up and the influenza tensor is actually much smaller and so in fact that's why there's a lot a lot incurred in communication right so it really makes sense when you have large sensors to think about you know how much time it takes to start up spark. [00:51:57] So this scales to large sizes of tensors using block partitioning and parallel processing you achieve faster results without sacrificing quality of the decomposition and you actually can do multiple types of constraints using this model question. Is this. With you. Melissa. So Flex of fact doesn't have much of a memory blowup. [00:52:32] Because it's the size of the sports answer and whether or not you can do the catcher out product. And so in converting to. The scent The nice thing is you actually don't have to formulate the product so it's actually a bit more efficient. This was run on relatively small snowed so I think they're like for Dot X. large or something like that so they're not super big maybe about I'm tempted to say it could be maybe 16 gigabytes of RAM at most on each worker known as. [00:53:06] Any other questions about this so now I'm going to move and switch to the last set of work which is thinking about how data is distributed so you can imagine that you have data everywhere so for those of you that have been to hospitals or clinics right you probably don't visit the same place you get referred around and depending on where you go you might visit someplace different. [00:53:34] And the question is well can I actually think about building better types taking into account all the data that's available right because in fact cancer factorization you need the more data you get the better the results you have right now there are some challenges associated with thinking about multiple locations so the 1st thing is security and privacy. [00:54:00] So then I cannot be centralized and the reason for this is because you want to be able to protect patient information so there's all these rules like HIPAA they really don't want you to aggregate all the data in one place. And the next thing to think about right is the prevalence of diseases or whether or not certain patients look the same. [00:54:27] And so what this is showing you on this side is the disease prevalence of diabetes right and what you'll notice is here in Georgia we're not very good the Deep South is terrible for diabetes very bad things and the question is whether or not we should model a Georgian the same as a Californian right or even. [00:54:51] Texan or New Mexican right. So there really is this different disease patterns the Pentagon the location that you're in. So 2 years ago there was this model called trip which was known as Federated concert factorization. And the key thing to know was that if we take all the data from the different sites and we put it all together that gives me a global tensor and in fact I can think of the local tensors as horizontal partitions of the global tensor. [00:55:29] So in other words I can make some approximation for thinking about how they all should belong together at the same place OK And then the other thing to think about is when I distribute it locally I want to synchronize the factors between the different sites right so in other words I want this to be learning across multiple sites and I want them to get similar results right I don't want each site to learn their own thing because it's then I'm not pooling information. [00:56:00] And so what happens is you have this printer server where you send the local feature most factors to the server and then you aggregate them and then you send back what you think is the one so that you can update your factors now there's 2 downsides to this work. [00:56:17] So the 1st is potential patient information in the kitchen so even though you're not explicitly sharing who is in your patient population you might be leaking information by sending updates of the factor Beatrice sees. You can think of this as somewhat to the Netflix challenge right so Netflix was supposed to release the 2nd 1000000 dollars challenge right and the reason they didn't is because there was leakage of information if you look back at I.M.D.B. right you could tell who was in the database and who wasn't right and you don't want such things to be found in health care and the 2nd thing to think about is what I'm actually asking that all the global the local futures look like the global futures I'm actually assuming homogenization population exists across all the different sites in other words I assume everybody looks the same right but in fact everybody looks quite different and so the question is how can we model this heterogeneity right so what we developed was the P fact which is a privacy preserving collaborative tenser factorization and in fact there's 2 big things that are different than trip so the 1st is that we focused on the idea that less communication costs so we wanted an algorithm that communicated less with the printer server and the reason for this is the less communication you do the less leakage opportunity there is OK The 2nd is that we actually wanted some notion of privacy guarantees and so there's this whole field of differential privacy which I'll give you some quick intuition. [00:58:00] But we basically find some guarantee for how we can protect patient privacy and the 3rd thing is to think about modeling a heterogeneous patient potions so how can we think about letting different sites have different patterns without violating global consensus. So there's really 3 components right so the 1st component is that we can locally factorize each tensor without sharing the data across multiple sites the 2nd is the global consensus so this means that every site should have the same features as the other sites and then to account for the heterogeneity what we introduce is the matrix norm now this is often in use and multitask learning the main idea if you look at this right is that I'm trying to set some columns to be 0 Right so in other words they're open to some female types that are zeroed out at specific sites which means that some female types will be present and some people types will not be present. [00:59:15] And then the way that you update is you use proxy. The sent at each local site I'm going to share the parameters every K. epix So rather than sharing every epic I'm going to explore locally more then send updates and that gives me lower communication. And then I'm going to go ahead and before I send the future matrices to the premier server I'm going to add some calcium noise and by adding the Gaussian noise it actually helps me guarantee patient privacy to some extent. [00:59:49] So I wanted to give you an I idea as to what differential privacy is so the intuition is that you want to make it hard for an adversary to determine if your data is in the database so if I look at the database and I query it what there are not your data is in there should not affect the query that returns to me right if it does and I can tell whether or not you're in the database OK. [01:00:15] And so in fact what we can have is a guarantee based on patient privacy. And so what you're looking at is the guarantee is based on the number of sites and also the number of epochs that we end up having to run in our algorithm to determine convergence and these other 2 are associated with how much of a privacy budget you have right in other words how important is privacy to you right and how much can you dedicate towards it. [01:00:46] And so these are primers that you can actually set so the 1st thing I want to show you is the impact in terms of fit on mimic 3 data set and also one C.M.S. data set so for mimic what we did is we partition the patients based on the I.C.U. that they came from and so there are 6 different I.C. use and for C.M.S. we just randomly partition so what I'm showing you is if you do a centralized algorithm which is see Payless an S.T.D. what you'll notice 1st right is that C.P.L. us hits a local minimum pretty fast right that's the black line and in fact if you convert from the least squares to us to cast a gradient the cent you actually jump out of that local minimum and you can actually get better accuracy or a better fit right and then trip is showing you basically oscillating close to C.P.L. us right and then P.P. factors thinking about well if I distributed across multiple sites and I add privacy what is the sacrifice I have to make in terms of fit right and so that's why you'll see that the fact in fact has a little bit higher are messy compared to. [01:02:03] That's arising from distributing the data and also arising from the noise OK. But the takeaway you should get is that accuracy is better using a base the purchase right less epix needed to converge is less communication so you don't notice that trip actually takes much longer to converge on this case right then you can factor and then there's going to be some tradeoff between accuracy and privacy now if we actually analyze it the 2 distributed algorithms from communication costs what you'll see is across the board actually we incur much less communication costs to achieve the SPED or accuracy compared to the existing algorithm and. [01:02:50] And in fact you can actually quantify this privacy in accuracy tradeoff so what you see is epsilon is the amount of privacy you want to dedicate right so lower epsilon actually means I want more privacy guarantees higher epsilon means that I'm willing to sacrifice expose some of my patient information and as you as you lower the spout right you actually do much better in terms of recovery to the original so closer to one means I'm I'm having no loss incurred from having to have privacy guarantees and this is if I wanted very stringent privacy and comfortable. [01:03:33] And so if you actually think about quantifying the impact not only from just quantitatively put qualitatively I'm showing you 2 different. Phenotypes one is discovered without privacy which is on the left and one is discovered with privacy so what you notice is generally speaking the top ones remain relatively the same order of slightly flopped right so you can see right but 9 essential. [01:04:03] Paradoxical and. Heart failure they're all present in all 3. And on the left side you see right heart cardiac catheterization and attachment of medical or flap graph right and then the other ones are just more noisy so there is some tradeoff but mostly you get the prominent things to remain the same and in fact if you actually quantify it and try to evaluate mortality in thinking about Minix 3 so C.P.L. us as if you did the centralized one trip is the previous Federated one and D.P. fact is the one that you have guarantees what you can see is that at 50 we are treating the high you say you see across the board right and in fact we wanted to quantify what was the importance of thinking about differential privacy and what was the importance of having this heterogeneity right and what you'll notice is if we turned off the L to one Norm our performance actually suffered significantly very soon actually had in foreseen everybody to share the same thing was actually a bad thing right so if ever I asked everybody to share the same parameters. [01:05:18] And you can see that actually if I turn up differential privacy I do even better right than turning on privacy so I'm sacrificing privacy for accuracy yes of people. For. This but for. The one. That's right. That's right but in fact to Karin keep differential privacy we're actually adding noise to the parameters that each local site is learning from. [01:05:52] Somebody. So in fact if you actually notice that one of the premiere is that you're sharing back right someone has HIV. Right so let's say that Iran and this local tensor factorisation and one of the factors that are turned on is H.I.B. right and in a small county I could easily ascertain who might have it right and in fact in sharing that information I'd be violating the privacy because I don't want to be able to easily tell who's in my patient database right so if if there is a diagnosis that is very sensitive such as HIV or some of the other I mean I've used a prominent example I'd say knowing the facts you can jeopardize that patient right from the perspective of pork and what not right because there's all the stuff that comes along. [01:06:50] With that you know most of. The fixing stuff to the. Right but to learn predators is actually the phenotypes right and so in fact if I release that one side has a specific type I could actually know a lot about their patient population and so the Galaxian noisiest preserve that right but obviously I'm not going to learn the best things this doesn't make sense. [01:07:15] Because if it was. It would be more you know you know they are you know much money they produce more than all of them. And then more. Money. That could be one that's right but in this case what we went for was the differential of private version where I can share information but just have some guarantees on what I'm sharing. [01:07:45] Comfortable So really modeling the stink patterns in the peach population can actually improve discrimination right yes. So we don't have that mechanism where we assume is that there are semi honest so each site is going to be semi honest as to what they're going to share and we're going to assume that the adversary is only just listening so not trying to actively perturb the system. [01:08:28] Yeah it was just our initial set was to think about that there's actually more things that you can think about doing yeah. Yeah yeah. Yeah that's right so we that's future work that we've thought about. Is to think about that situation but this is that the semi honest is much easier I think it's semi honest but curious is the actual model that you go by. [01:08:58] Comfortable so I wanted to quickly cover the importance of heterogeneity right and in fact what we discovered in the process of doing this is that there's. 5 different I see is that are related to adults and one is related to NATO and if you actually try to model them all the same the neo natal is obviously a very small population and it's quite different right and so this is showing you what the result is if you actually looked at the L 2 a norm the Matrix or what this is telling you is that these particular types are turned off only 102325 are turned on right or 24 and 25. [01:09:38] And so only some of the phenotypes are really present in the Q So that's what this one normally buys us and in fact actually if you plot the scene for the nick you alone right when you see under our model is that the 8 you see is constantly getting better because we're actually learning more pertinent factors associated with the nicu as time goes on so as we give them more and more rank there's actually more flexibility for us to learn nicu specific phenotypes compared to the other ones and so that's why you see really there's a scan in terms of you see. [01:10:17] And so really with larger ranks you can really find these important nicu related ones and in fact actually if you look at the. If you look at the actual phenotype So 252-930-3435 if you look at the logistic regression model coefficients that you build with them what you notice is that congenital heart defect in the morning respiratory failure occurs quite often so this is the prevalence right. [01:10:47] And then this is looking at the importance and you can see that actually they're corporations This is negative which means that they're associated with lower mortality OK now if you look at anemia and acute kidney injury what you see is that they actually have positive core fissions and they're less prevalent but they're actually associated with higher mortality and in fact if you look at the literature in problem at the supports these findings So really the summary that I want to get that I hope you take away right is that you can actually store data multiple sites and you can discover both global and site specific types and what we have is a mechanism for thinking about communication efficiency right and there's a guarantee of pretty patient privacy and. [01:11:35] So then what I hope you took away from the top rate is you can think about tensor factorization. And you can actually access add all these different components right so thinking about how you can incorporate information from pivot right which gives you see the quick thinking about how you can distribute it which gives us granite and then also thinking about how you can make it privacy preserving and that gives you a fact and what that gives you is healthier sensor. [01:12:04] And so with that actually I just want to thank all the collaborators who have done all this work in fact I've done very little of it right. And so these are all my collaborators I'm not going to list them by name there's many more for other things I've done these are just the ones for the work that I've covered today and if you're curious this is my unknown Web site if you want to contact me that I'm done.