[00:00:05] >> All right good morning everyone. This will be a little bit different going to talk a little bit more about application here and some of what we're doing here at Georgia Tech both on the side which is the Georgia Tech Research Institute applied research side and on campus as it relates to working with observational health data analytics at scale so I think you might hear a little bit more about maybe some genomic Slater from Oakridge I'm not sure about stuff at Oakridge I know you guys are doing that but here we're doing a lot around observational health data and I just wanted to highlight some of the challenges we run into and where maybe some of the things that happen in crunch and some of these novel architectures might be able to come into play and help us with those challenges so a little bit background yes we're teetering hers here where the Center for Health analytics and informatics this is led by Dr John Duke over at He's a joiner point which campus Harvard Indy but also a master's in human computer interaction so quiet a rarity someone who can code and also diagnose disease so we also we are joint center in our goals are to advance the foundational technologies and help data analytics support partners and maximizing the value of their health data resources and this is very important provide facilitate targeted delivery of information so one of the key aspects that we run into is how do we create interaction how do we create insights that somebody will actually listen to and then this is probably what we all care about here scalable and secure real world data research nationally and globally so what does it mean to be scalable where we have lots of data in ones place we have lots of data separator across many sites and there's also an aspect of security that we have to care about we care about HIPAA and privacy and also the ideas of things like. [00:01:47] Identifying data synthetic data generation that all require high computational resources to do. So what does it look like what do we do in China China's mission is to improve human health or innovation in Capture analysis and delivery of data so there's really 4 types of things we're trying to do one of them is the health data aspect so health data comes in many shapes and forms we work with a lot of claims data we work with a lot of. [00:02:14] Our electronic health record data we work with sensor data so the things coming off your smart watches the things that are selecting patient generated health data that also require streaming analysis sometimes to be able to interact and intervene you know for instance when I was walking over here my watching thought I was doing an outdoor walk and wanted to tell me to record it right so being able to come in and make those kind of at the point that the patient cares about those kind of interactions that they that interact and encourage the behaviors that you want and then of course type data that are about very precise precision medicine type research that is being done on a lot of at this point a lot of G.P.U. type research for Neural nets so we want to take that health data and process it securely that's probably one of the biggest challenges is being able to have scalable architectures that are secure and that we can get to sometimes. [00:03:07] Sometimes you need you know a rig from the king to get into some of those you know those those to those rooms where you can actually even work on the data and then the analytics is very important so we have a lot of partnerships here at Georgia Tech and also in Jeter our We're looking at how do we develop new and novel machine learning how do we develop new and novel analytics that we can actually do things like predict you know disease can we predict reoccurrence of disease can we predict cost so there's a lot of work there that we want to try then take this large amount of data and feed it at scale to these analytical methodologies how can we do that if it doesn't fit in a G.P.U. how do we do that across multi nodes how do we do that if there are disparate cites in a federated way so for instance there's a health data network. [00:03:53] That we work with called Odyssey. And there is a bunch of clinical sites that all partner with us to say OK we have a large health data it can't leave this site but we want to be able to process and run your analysis we can take some output and that analysis thing can then be sent back to you so how can we run these kind of things in a federated scalable environment then of course the most important part we can do all this cool stuff but how do we actually in that change how do we provide effective decision support such that when you're interacting with your doctor when your doctors are discussing or when you're making a decision that you make the right one so for instance about 90 percent of what doctors get told today whenever they are provided with a decision support system is ignored you go to the doctor they prescribe you a drug of some kind and you're going you're going to take that you're going to leave they're going to go back to their office they'll type it in they'll say contra indicated blah blah blah blah blah ignore ignore ignore I know better let's move on right so it's about also providing the ability to apply these analytics at the point of care at the time of decision because. [00:05:01] Doctors think they know better and they want to trust that decision that's the other aspect explain ability being able to build scalable architectures that can run these more complex models to provide scale ability very important and then of course targeted action what are we supposed to do with it how do we actually make it happen so this is our pie chart of why it's hard it's difficult so I'm going to start with maybe we're going to save hard to scale to the end because I think that's the care here but you know platforms are fragmented how do we get data interoperability of data is a hard thing varying use of standards for instance we work with D.O.D. in V.A. If you are in the Department of Defense and you are enlisted in the Army Navy whatever your health care is provided and handled by one electronic health system when you leave the D.O.D. and you go into the Veteran Affairs. [00:05:49] It's a different health system so they're completely disparate how do we have those 2 things talk together and how do we get varying use of standards to work together information overload interruption and workflow that's what I was talking about with the doctors nurses they just ignore insights tonight equal behavior change then we start to get into some of the more challenging analytics aspects the holes in the data the disparate data types that you have to deal with do we have enough patients we need to be able to scale to the patients that we have to collect all that data from multiple sites and work on it methods or on certain models that lack interpretability this gets into some of the advanced research we're doing here at Georgia Tech and across the country across the world really to create more interpret all machine learning models you might look at DARPA's explainable type of work here for that. [00:06:38] How do we make interpreter models that we can actually run and scale how can we leverage things like for instance whenever I am diagnosed with a disease the doctor writes up a note about what he did how can I leverage his explanation for his decision making to also help whenever I'm making the same kind of diagnosis prediction with some sort of machine methodology results also don't equal insights and I think that's the biggest challenge out here. [00:07:03] To you if you're interested in this is not just about I need to get in and you see I need to get and you see that useful right I need to get something that I can actually do something with that actually drives to the real problem so hopefully today I can provide you with some of the real problems that we're trying to do where we can try to scale and of course hard to scale this is very important especially in the case but as you'll see we work with observational data sets that have 115000000 patients with millions and millions of data points as it relates to all the diagnoses and conditions and procedures that occur across the way so we have 4 pillars that we work on here I am sort of the lead on the side for the analytic side but we work on scalable health data bringing analytics that scale interoperability how can we get these to all work together and actually go to the point of care and then the human factors how do we actually drive people and do it in a way where they'll actually care about what you're telling them they'll listen and they'll trust you so that's a whole science into itself maybe one that we ignore a bit too much I would say maybe human factors is another pillar crunch right we we're going to make these new architectures but how am I supposed to use them right make it where somebody can actually leverage it at scale. [00:08:19] So here's some of the areas that we work in that we would say could benefit potentially from some of the work that that's being talked about here so in sensors and I.O.T. there's a lot of streaming high performance data that both needs analytics at the point that it's collected and also being able to handle streaming analytical data as it comes in there's data integration there's natural language processing Nowadays we basically use something built on top of the Lucene index a solar type search of some kind is where everything is built off of all flash arrays what could we do to go at even higher scales than that we use common data models in order to get everything into one format that we can work together on the analytic side obviously machine learning and all flavors of the above both traditional and classic to the deep learning and everything in between we do things like computable phenotype so you might think that all we're trying to do is predict disease Well actually one of the biggest challenges that we're trying to do is defining patient groups consistently So think about drug companies they have a new drug they want to test this is a political 2 cancer patients with this particular type that have had it for this long that presented with this and haven't had this and so how do I create that in a way that I computer Billy can find patients consistently across the entire network. [00:09:38] And how do I define that in a computer by way so there's a lot to do with machine learning about finding patients of interest in a consistent way that requires highly scalable or highly scalable computation and highly scalable computing So there's things like give me a C. a set of patients that have the types of things I care about so these are some example good patients example patients now you can think about seed set expansion you can think about some of the graph algorithms that are out there that can expand on some of these Knowledge Graph to say yes these are other patients that are good in these other patients that bad but we need to be able to support those kind of skills and obviously all of this requires high performance computing now on the other side of the coin I just want to give it a fair shake is there's things like fire which are about fast health interoperability resources so if you're interested in applying these types of technologies you can use things like fire to actually get data in a standard way across many domains and across many. [00:10:36] Different for example extra electronic health records many of the big vendors actually support fire apps so this is a way to integrate and actually use things within a real or a real area in a real application and then I've mentioned the Federated research platform so this is about thinking about not just do I have all the compute in one spot how do I work if I have the compute distributed how do I work if they actually do these can't talk to these but the results can talk to each other what can I do there and then of course human factors product evaluation how do we design the actual rooms the set ups when we actually talk to the providers etc etc. [00:11:17] So like I said these are the goals of what the Center for Health analytics and informatics is doing we want to do the scalable and secure in support of Health Care Partners and this is where we're working right now we are working with a variety of organizations ranging from drug companies to government the C.D.C. F.D.A. as you would expect we have partnerships with all of the health care providers here in the Atlanta area places like employers like me Google Health all of these places and so these are providing us lots and lots of use cases lots and lots of data sources and this is the type of partnerships you'll see as you start to look at people working in this observational health data network. [00:11:56] What we've built here is sort of a platform that we have that you can do health data analytics research so this is a platform that could be transportable So for those of you who are GA Tech you can actually tap into this platform access data here if you're interested but also this is a platform that could be stood up and we're standing it up at clinical research sites etc So we have data networks where we have structured claims that E.H.R. data at scale like I said hundreds of millions of patients of records and structured data tons of clinician notes that's where we do a lot of our work and that's where the scale ability comes in the concern sensor and Io T. work where there are places where we're building our own sensors our own smart watches yoga mats all these things that are streaming sensors in at the same time and we want to correlate those in real time to find some sort of outcome or recommendation patient reported data and Olmec status makes data has definite scale issues as you would imagine because there is a you know everybody has a lot of D.N.A. you know D.N.A. is a lot of data right so we take that data network we provide it in a way that you can access it and work together now where to scale ability come in the process if you're familiar with E.T.L. you're probably thinking sequel queries right I just need to run a bunch database queries extract transform load now I've got a data frame to work with well this is not all structured data so extract transform load also means I need to work with unstructured data I need to build word vector models I need to build factoring betting's I need to build topics on top of the top of the notes I need to standardize and harmonize both structured and unstructured data which requires scale ability pheno typing you might say OK how do I find people that are similar well here istic is I come up with a search query that's the old way to find rules the computational way is what I described before how do I find people of interest people not of interest and do some analysis to say OK here's a group of interest in here is not a group of interest how do I look at it in an unsupervised way. [00:13:56] To find groups of interest I couldn't have even defined using a set of rules how do I find these patients are all very similar for these characteristics can we study them and we didn't know about it and then of course validation how do we then present all of this back to the doctors to validate that we actually built something about you then analysis of course is machine learning machine learning machine learning. [00:14:18] Population level estimation how can we make population level decisions how do we do descriptive statistics in characterization of populations how do we do individual patient level prediction how can I predict what is the right diagnosis for you what is the right drug for you what is the right course of treatment for you based on everything from your genome X. your past medical history that clinician notes everything we have about you what how do we bring all of that together and do that a scale and you can see there are some definite challenges there so at the end what we want we want reports and dashboards Obviously everybody wants reports and dashboards we want cohorts we want people a group of people that say yes this is a group of people that we can study and say are in fact the same group of people and it's definable in computer bill so that we can be very specific about who we are study so that's that's very important obviously the models so and explainable model to it's going to be very difficult and I struggle with this all the time I work with John do like I said clinician he doesn't want to model that tells him what he already knows but also he doesn't trust me if I give him a model that tells him something he didn't know right so that's the difficulty if it tells him something you already know he's like well I learned that my 1st 6 months of med school doesn't stop me didn't know he'd be like why did it tell me that no right so we need to challenge that it's not just about having a good model. [00:15:45] Absent A.P.I. So this is getting into Also when we think about some of these novel architectures and trying to get into it how do we actually leverage them in a way that doesn't require someone that had to get a Ph D. in it to work with it right how do we have the ability to expose this right now we work the people that work with me and work for me use Python Jupiter are right they're not computer architects I do have the luxury that I have been studied and studied computer architecture for a very long time so I actually understand what's happening under the hood but not everybody does so how do they leverage that new technology in a way that actually enables new results that we couldn't do before without having to get Like I said go to school for 10 years and learn how to do it so I just wanted to highlight with some of the time I had remaining Some of the work we've done be happy to take any questions as well so one of the places where we're working is natural language processing for oncology data this is an open source platform if you're interested in an application this is called clarity it's on get we have a weekly webinars you can go get on and watch and learn how to use clarity and this is what Celgene and F.D.A. are using to do a pretty competitive how do we identify patients of interest across many clinical sites using unstructured data so again what's the backbone of something like this solar Lucene tech searching word vectors word and beddings So there's a bit of everything in there that has to all be pipelined together and there's pieces bits and pieces there that all need to go to scale. [00:17:20] So what are some of the opportunities here well we need the ability to tune our embeddings to large corp or of text right right now if it doesn't fit in memory it's very hard to create word in betting's right the word embeddings you end up having to to narrow down or or or reducing If it doesn't fit on the G.P.U. it's going to be very slow and we want to do not just word embeddings we also care about Sub-Warden betting's So we want things that are out of dictionary to be within our corporate So if there's misspellings or we you know somebody talks about something a little bit differently we care a lot about this so we leverage G.P.S. for some of this but what else can we use scalable N.L.P. tasks topic modeling there's been a lot of work here in the past it Georgia Tech and using things like scalable non-negative matrix factorization to do large scale top modeling but again that one was entirely one system that in memory kind of thing so how do we scale that out beyond. [00:18:16] Beyond what we can do today because right now we have to look at online methods or things like that that may not be as good in order to actually go over the entire corporate tax that we have supervised training of natural named in any recognition you can imagine the words inside of a clinician note is not the types of words that we're trained on in Wikipedia necessarily there's a lot of lingo that may or may not be there there's things that you need to do to create more trained models and we need to be able to do that at scale and like I said computational pheno typing expanding the rules to find other patients like the target group. [00:18:52] Also interpretability we need the hardware to support the complexity of the neural nets the M.O. methods everything that scale so we have the truth that markets can database this is something that basically every drug company buys and we have about 115000000 unique patients with millions of observations diagnoses continued conditions of measurements that we want to analyze and build so we want to there that down to a phenotype we then want to come and build some sort of predictive model so we need to be able to make interpretability we need the traditional interpretability how do we design the neural net to maybe make it more interpretable but also how do we add interpretability across this entire pipeline of machine learning so when we go and we have someone work in our group we have maybe fire as an inner inner operability layer that feeds us some data we might go and pull data from multiple sources bring it in someone wants to work in like a Jupiter environment some sort of interactive environment they don't want to necessarily go off and launch a job on a P.C. cluster and come back the next day or if they do they would like it to be more like a Jupiter hub type environment so the question is how do we help make working in this environment more interpretable how do we make it easier to work with other challenges things like this fighting the P hacking problem so we do a lot of work in observational health data and it is not a controlled experiment right we are not actually setting aside a control group and a test group so we have to mimic that when we're looking at observational data the problem with that is if you set a P. value of point $5.00 how many of your experiments will be significant on average just randomly. [00:20:30] 5 percent right so what you do is you run 20 experiments you find the one that's significant and you publish it right because nobody will publish the other ones so why do we do with Odyssey and observational health data we have new methodology is where we run lots of experiments and we look to see adjusting for things that we know positive controls we know this is correlated with this from actual studies we know that this definitely is not correlated with this and leverage that to change where the P. value line would be to be more significant So whereas in this graph over here you would see everything below the V. should have been significant when we read all those studies but once we've adjusted for positive and negative controls there's only one study here that actually shows up as if. [00:21:18] So being able to do this is scale requires us to try many different things at scale on all this data at the same time across multiple organizations everything I talked about in order to do this there's actually a study from I can link this paper later where this came from where they showed what is being published and what has not been published and everything that's published just hugs this line right and I know computer architects are not guilty of this at all with fudging their simulator parameters at all until it gets to something that matches what you need to get an escapade for right so there's interesting things out there about this some just sort of for a preaching aspect things that we're doing with the DARPA next generation social science there's new ideas about things like registered reports and things where you go out and some journals now you publish you submit your research what you're going to do you submit the analysis you're going to do and you say this is exactly how I'm going to submit it they review that for whether or not it's valuable and regardless of what you get at the end they'll publish it as long as you follow your prescription so then you can actually publish negative results so that's one aspect. [00:22:28] Another challenge is synthetic data generation so I have some of this data but how do I generate more data that I can use right so taking a population maybe a health system has a data set I want to be accurately represented but I don't want to actually use that data because it's ph I how do I create a synthetic data set that is both representative and useful and to do that at scale requires a lot of computational power and there's actually a very open problem so where can you where new hardware help scale scale scale we look over here we have tons of unstructured data structured and unstructured data from personal health records home treatments reports images everything in between then we can get into some more creepy stuff Twitter see how they're doing that kind of stuff so the scale is massive of the types of data the other one is how do we handle this real time sensor data this is one of our scale issues we can't get everything off of the store everything at all times we need to be able to do in situ in real time in correlate against all this information at scale so why do we do it just so you can understand where we sit at the current state of the art Well we have. [00:23:39] I come I go before Oakridge so that they can win the battle by a large margin 10000 cores over a lot of compute clusters we have a bunch of use you know petabytes a distributed storage people use Matlab anaconda that you know they're writing at a little bit higher level all on premise all flash solar cluster in order to end structure tags so how do we the challenge here is how do we expand this to the novel architecture of the future how can we scale using the ideas that are in this room how can we ease the movement of the large data to those new architectures ensure privacy so that we stay HIPAA compliant and how do we provide easy that allow people to actually use it without Like I said having a Ph D. before they can even touch it so if that that's my talk I'd be happy to take questions. [00:24:29] THANK YOU THANK Yes. There are there are situations where that helps because we just need to do a bunch it's embarrassingly a parallel we need to do a bunch of things in parallel right but right now I would say the all the investment especially at the deal he has been in to use a neural nets right that's where all the investment is but I think there's also a need here for things like solar and text and those kind of things which I'm not sure I'm not sure the deal he provides Like I said we do it all in-house for our for our stuff so. [00:25:30] Yeah that's that's that's a completely different architecture and also something that said you just throw money at the problem right now. Yes So there's like you know what can I do off of this watch right I can't send the raw signal of the heart rate so I What does the watch do it calculates heart rate on the watch and it says the heart rate so building up scalable computer will A.P.I. is that defined those different pieces and being able to do that on the watch to create what we need or we have a yoga mat that you stand on sort of like a we fit board if you ever had one of those to count steps and balance and things like that but we can't send the wrong signal off the board we need to do it there those are the kind of things where we're doing that in situ processing and the more we can do there the better especially whenever we get out into a less connected environment right you know you're out in the world you don't have cell phones you know how do I continue to provide you feedbacks right the new and supposedly of these supposedly has an E.C.G. and it kind of thing right so there's all sorts of stuff but I got a guy Jeter I don't tell you not to trust that because he thinks all their algorithms are garbage you should use uses. [00:27:01] Yeah. Right. So even when you have much larger data sets it still narrows down to the patients of interest right and so when we do have a lot of data it does do better but the problem one of the big problems that I didn't talk about is if I define a rare disease even if I have 100000000 patients how many people got that rare disease right so that's still the big challenge I don't even talk about class imbalances like number one is the problem that's why you see it's barely better than a coin toss and also mimic is a tough data set because it's all sick people it's all I.C.U. data right so it's not balanced at all as far as the population is concerned. [00:27:57] Thank you. Thanks.