[00:00:05] >> I just want to say welcome to everyone that here and we have our 1st plenary that will start just in a in a couple of minutes Dana Rendell is the organizer of this conference she just stepped out but I just want to say that she's done an outstanding job I've seen so many different disciplines in organizations in the room over the last few days so we're really grateful to Dana for this organizing this just terrific conference I'm Katie has been feeling and I chaired the School of Public Policy here George attack and it is my pleasure to introduce to you the this morning Dr Daniel Neal who will give the plenary Tarkin title this you can see on the screen machine learning and event detection for Population Health This is a topic that is very near to my heart because it's about the public service of science and scientific activities in 2010 Dr Neil was a recipient of an AND I SAID career award and on a similar topic machine learning and event protection for the public good Dr Neal's research therefore focuses on developing new methods for machine learning and event the Texan in massive and complex datasets with applications ranging from medicine and public health to law enforcement and urban analytics he works closely with organizations including public health police departments hospitals city leaders to create in the poor data driving tools and systems to improve the quality of public health safety and security for example he is evaluated early detection of disease outbreaks and employed methods for detecting and preventing hotspots of violent crime so as you can see Dr De Niro is really at that intersection of machine learning tools and in the service of the public. [00:02:04] It he has he is an Associate Professor of Computer Science and Public Service at N.Y.U. also in the Wagner School of Public Service and you have a lab right learning for good at N.Y.U. he previously was a professor at Carnegie Mellon where he also received his Ph D. He was named one of the top 10 artificial intelligence researchers to watch by Tripoli intelligence systems so as you can see we are going to hear a spectacular presentation by Dr Neil please welcome to the stage for our 1st winery speaker Dr Daniel Neil thank. [00:02:49] You. Thank you so much for the introduction. And your invitation and I'm delighted to be here. So I want to start with a little bit of context on kind of who I am and what I do I direct the machine learning for good lab at N.Y.U. and the main goal of this lab is to think about how we can target interventions in the public sector both more effectively and more equitably so in addition to working in a lot of different application to mean some of which I'll describe today we also do a lot of fundamental methodological work in areas ranging from event and pattern detection to causal inference and learning to rhythmic fairness so what I want to talk about today though is to really focus you to drill down on one of these particular application to me and in particular what machine learning can do for public health. [00:03:44] So I'll be talking about 4 different application domains but in all of them the pick of the big picture is really the same which is we achieve very early and very accurate detection of emerging events and patterns we use that to give public health better situational awareness that they can then use for more timely in more precisely targeted interventions so we'll talk to as I said I talk about a number of applications including the early detection of disease outbreaks how we can intervene to combat the opioid crisis what we can do kind of more broadly given public health the awareness to identify events and trends that they might not already be looking for so giving them a safety net for public health and finally time permitting I'll talk a little bit about health disparities and environmental causes of those particularly housing So let's start with outbreaks why should we care Well a lot of the early stage funding for this sort of work really came from for example the Department of Defense who is very concerned about a bioterrorist attack this remains a very real and scary possibility one estimate from DARPA Is that a large anthrax release over a major city like Washington D.C. or red Lana could kill between one and 3000000 people and hospitalized millions more. [00:05:09] Now interestingly though the disease surveillance community has come to realize that an even greater threat is posed by naturally emerging infections right and of course there's a different one right every year that everybody needs to be thinking about and worrying about but I always go back to bird flu pandemic. [00:05:27] Influenza because avian influenza right had a spring leave high mortality rate and in fact the only thing that was missing from becoming a global human pandemic pandemic was the evolution of rapid human to human transmission right so that's that what if scenario was the one that always sticks in my head now that being said we want to prepare public health not just against the kind of once in a blue moon really catastrophic event but we also want to give them tools to better respond to sort of things they run into every day right so how can we do better response to common disease outbreaks as well as other emerging public health trends like opioids as I mentioned so I don't argue that in all of these cases a better understanding of emerging patterns writes earlier detection has great potential to reduce the cost of outbreaks to society by both in lives and in dollars so one example I like to give imagine a daisy or this little stick man gets exposed to inhalational anthrax he starts developing a few flu like symptoms they evolve into higher fever shock and eventually a dead man now the good news or good ish news right is that the earlier in the progression of the disease right the stick man is treated right the lower the mortality rate and in fact one estimate from DARPA Is that a 2 day gain in detection time right assuming at least actually 2 days more timely and effective public health response could reduce fatalities from a bioterrorist anthrax attack by a factor of 6. [00:07:00] That being said while early detection is important it's also hard and one reason for that is the large lag time right between the start of a person's symptoms and a definitive diagnosis that yes this person has anthrax or influenza or bird flu or the care whatever right and of course one big challenge is that the early stage symptoms of these sort of things right are very typically very. [00:07:24] Very general how that being said right there are a lot of changes in someone's behavior again even in the early stages of illness maybe they buy over the counter medications maybe they're absent from work or their children from school maybe they go online and search for or tweet about their symptoms right or eventually maybe they'd visit a doctor a hospital or emergency department so of course one person doing this we don't have a whole lot of signal but in fact when a number of people are affected right in the same general area then we can actually detect these emerging outbreaks in the early stages has increases or changes in various quantities for example the number of over the counter medication sales in the affected area so that's one theme of this talk right that we were able to achieve very early detection of emerging disease outbreaks by gathering syndrome Aker symptom data and identifying emerging spatial clusters of symptoms so this is actually through one of the 1st areas we worked on and a lot of the other work we've done since has kind of built on this idea so here's one example from a while ago where we were able to detect a spike in the sales of pediatric electrolytes north of Columbus Ohio that led to a story that was indicative of a fairly small and localized outbreak of gastrointestinal illness. [00:08:44] Now this is a one particular example of what I see as a more general problem which is multivariate event detection so here is what we have is spatial time series data from a large set of spatial locations as so these could be zip codes for example for each monitored location we monitor a large number of data streams D.M. So these could be things like emergency department visits with different sorts of chief complaints over the counter medication sales in different product categories etc and then for each site for each location aside for each data stream D.M. We have a time series of counts for example daily counts C.I. Mt Now given all this data we have 3 main goals to detect any emerging events to pinpoint where and for what time period. [00:09:34] The event is occurring so I the affected subset of spatial locations and finally want to characterize the event somehow I want to tell public health what's going on here and that could be something simple like identifying which subset of the monitored data streams is actually affected another way to think about this is actually is a very large hypothesis testing problem right so we're comparing a large set of hypotheses H one S N W D represents a subset of the monitor data streams and this represents the affected subset of locations and W. as a time window or look at our duration we want to compare all these hypotheses to each other and compare them all against the null hypothesis H. Not that no events of public health interest are occurring so to do this we can rely on what we call an expectation based scan statistic so so the idea is that we're going to search for spatial regions you could think of a spatial region as a subset of monitored locations so for example subset of zip codes where the recently observed counts for some subset of data streams are significantly higher than expected. [00:10:38] Now of course this requires is 2 steps right the 1st is figuring out what we expect to see right and you can imagine doing this with pretty much any form of time series analysis right so we're going to compute the expected counter baselines right for each location for each data stream for each recent day event time series now is this method could be something as simple as a moving average more recently we've been doing a lot of work without seeing processes and how we and scaling them and then again that seems to work pretty well. [00:11:06] So given all this right we can then compare the actual expected counts for each of the many possible subsets of data streams locations and time under consideration there's a way we do this as we compute for each subset we find out how the subsets with the highest values of a likelihood ratio is just tick what is that that's a probability of seeing the data that we actually observed given that alternative hypothesis H. one of the S. and W. right divided by the probability of that data and again if nothing of interest was occurring now of course the question is still given those subsets that we found writing and we did all the search right so just by chance we would expect to see some things with kind of high scores so given the fact that we searched how can we actually tell if there's actually something that is just really significant so the answer it turns out is what we do is a randomization test right we say how likely would we be to see any scores as high right if the no hypothesis was actually true we can do this by simulation we create a large number of data sets under the No we do the exact same search process rather in each of those many data sets where we get the distribution of maximum scores and then we can compare we actually observed to that So again that's actually a way of bounding our false positive rate right no matter how much searching we're actually doing. [00:12:23] OK so how do we actually make this efficient right now obviously there are a lot of subsets in the data so it turns out that a lot of what we've done the last 510 years has been based on a simple yet very useful methodological idea which is that we can actually treat pattern detection as a search over subsets or I like to call it a subset scan and in some cases we can actually do this scan very very efficiently. [00:12:51] So again you can imagine maximizing some measure of interestingness or anomalous ness over subsets and that has a lot of benefits in terms of system will power for one thing it allows us to find very subtle patterns right that typical anomaly detection methods might miss right so think about what you've got is kind of combination the needle in the haystack problem right where the what you're looking for is very small compared to the entire search region where by the same time it's also connect the dots right because looking at data from any particular spatial location any hospital etc right may not be sufficient to realize that you have a pattern of interest right so in fact this is a hard detection problem and sometimes we need to actually look at that entire subset right to realize that something of interest is going on OK So of course this would be totally computationally infeasible actually maximizing the likely duration is just over all the subsets However we've actually developed actually number of algorithms which we call facets that scanning that can efficiently identify the most interesting subsets of data records without an exhaustive search OK So again what we have here is actually a really powerful building block right that we can then build in all sorts of constraints and actually solve the sort of constrained search problems that occur in real life settings. [00:14:08] OK So for public health surveillance this helps in a lot of ways right it lets us actually detect irregularly shaped clusters of disease right this leads to things like better spatial precision for identifying outbreaks as well as higher power to detect emerging subtle outbreaks in the early stages it gives us the ability to combine a multiple data streams right you can you can think about searching over subsets of data streams as well as subsets of space here and finally it lets us do fun things with massive and complex data sources such as Twitter search queries that cetera so I was sort of brief teaser right again if we actually look at the heterogeneous network that is Twitter we can actually detect these sort of subsets of a graph nodes right there might actually include a number of different things locations users keywords hash tags links and videos so this not only allows us to detect disease outbreaks earlier so a particular example would be looking for rare disease outbreaks such as hantavirus in Chile I don't I WAS us to detect again but that allows us to characterize both. [00:15:12] What is going on so for example producing the word cloud from the tweets as well as where and who is tracking about it and so on so again this is a very powerful approach not only for early detection but for giving public health a good idea right of what is going on in their data. [00:15:30] So I mention the multivariate event detection problem and I'm going to add one more wrinkle which is in addition to detecting pinpointing and characterizing these emerging outbreaks of disease we can also think about who is affected and in particular identifying any differentially affected some populations of that monitored population Now keep in mind that a sub population could be defined along multiple inner multiple characteristics or their intersection right so we could think about outbreaks that might have differential effects based on gender based on age based on ethnic or socio economic group or based on the presence of Behavioral Risk factors. [00:16:08] So again we could actually identify any combination of these more generally right we have a set of observed typically discreet dired attributes right and now we're saying not only where and when the outbreak occurs but what is a subset of affected values right for each of the attributes and by the way we be doing this now with individual disease cases not just with counts of how many disease cases are happening now of course this can this approach applies in the general disease surveillance an area but I wanted to talk about it in a different context which is again what we can do about the opioid crisis so as you all probably know this nation and for that matter the world is having and increasingly serious problem right with opioids. [00:16:56] And you know. Drug use disorders more generally in that 2017 we had more than 72000 drug overdose deaths right more than any year in recorded history so again this is a huge problem and increasing one so public health at the moment is extremely motivated right to understand what's going on to identify and predict emerging trends right in opioid use disorder in overdoses and better target the appropriate sorts of interventions now of course there's a lot of different interventions a public health can do right they can try to prevent people from getting addicted in the 1st place they can treat opioid addiction using medication assisted therapies they can rescue people who are currently having overdoses through rapid administration of the locks on which convert reverse the overdose and finally actually direct hopefully some of those people into recovery right through peer recovery coaching for example now of course right in addition to how they should intervene there's a lot of questions in terms of where and for whom. [00:18:01] Right so I'd argue that machine learning has a really important role to play here it has a potential to save lives right but attempting subtle and emerging patterns of overdoses as well as opiate use disorders in the early stages enabling public health to target an early and effective response so we've done a number of things related opioids we started with this idea of geographic opioid surveillance and again this is fairly simple this just says OK where should we be focusing right our intervention efforts so you can think about the main goal as predicting overdose trends both in space and time as well as identifying right current what's happening currently rights or other anomalous spikes in overdose deaths so there's a number of useful predictors right recent spatial or temporal trends in overdoses in a variety of leading indicators such as behavioral risk factors as well as sort of broader neighborhood characteristics so again we've done a lot of work in terms of forecasting and in particular how we can scale up right these very powerful very flexible non-parametric methods like kerosene processes right and that's really important when you have correlated spatial temporal data right so both how you can actually predict accurately in that setting right and also how you can distinguish anomalous patterns right from correlated to fluctuations right that otherwise might look anomalous if you're not taking these correlations into account. [00:19:28] So here's an example I will use aggregate monthly counts of fatal opioid overdoses for 6 New York counties of New York City area about 16 years of data we developed a new approach which we call very creatively Gaussian process subset scanning because it combines Gaussian processes to model the correlations and subsets scanning right to identify the most anomalous space time regions we compared our new method to typical anomaly detection approaches on both real and synthetic data one thing we saw are again not going through all the results now but we saw that yes the process subset scan does better than either Gaussian processes or subset scanning alone where this really means 2 things One is that when you have a subtle anomaly it's not enough to just look at individual points right the nearby points matter also when you have correlated data as you do in many urban problems right at the covariance structure really matters and you've got to take that into account so what we saw is just simply significant spikes in overdose cases and they actually corresponded to very interesting things RICE The New York City area had an overdose spike in mid 2006 which turned out to be just before the introduction of a city wide program for new locks on distribution now it's not clear to me right if if this was causal right if that Spike actually could have been a lot worse if if the city didn't start distributing locks on the effects of the actually cause the city to have to realize hey we need to do something so in any case we had heard another spike at the end of 2015 any of you who are familiar with kind of what's happening with the current opioid crisis would not be that surprise right that a lot of these spikes are due to Fenton 0 which is pretty scary stuff but I get in fact back in 2015 that was a little bit less obvious. [00:21:17] OK And the other thing to note is if you just apply your off the shelf anomaly detection method for this one class S.V.N. for example you fail to detect the relevant trends and you just get kind of wherever any of the individual county overdose levels are high OK moving on to Sub Pop you lation monitor again you could think about this question is for whom right for which the population should I intervene and here the goal is to provide an early warning system for newly emerging spikes or clusters of overdose deaths right at some population level so this is based on the new detection method which kind of combines this idea of subset scanning with a few other tricks right and allows us to detect emerging Geographic demographic and behavioral patterns so again think about it because a combination of these characteristics again both doing earlier detection in characterizing where and who is affected so for example we might be able to say something like we observe an emerging cluster of overdose in these 4 zip codes right for white males aged 20 to 49 who mix their opioids early there are heroin with alcohol so again very carefully identifying right sort of the behavior risk factors the demographics locations and so on. [00:22:33] So the way we do this again is identifying subspaces of the attributes space so again think about a subset of values for each of the monitored attributes with higher than expected numbers of recent case counts here we're actually talking about fatal overdoses so that's the sort of data we're working with that means we have toxicology rate so we actually know what sort of drugs are in each person's bloodstream when they die OK So again we've had the effect the genders affected age ranges which drugs are involved as well as space and time. [00:23:02] So to do this we have again have 2 steps I turns out that in this setting actually even figuring out what we expect to see is kind of hard so we developed a novel tensor the composition approach which again helps us actually do this. This again across multi to multiple dimensions while maintaining computational efficiency. [00:23:22] The approach to actually finding the anomalies you could think about is an iterative conditional optimization approach where conditioned on the subsets of values for all of the other monitor attributes we can efficiently optimize over subsets of values for a given attribute using these sort of facets that scan techniques and then we can iterate that thread across all the different attributes until convergence So again if you're interested in kind of in what's going on in the hood there please feel free to check out some of our fast subset scanning work OK So we applied this to overdosed out in Allegheny County Pennsylvania here we had this county medical examiner data for fatal drug overdoses about 8 years of data about 2000 cases over that time period and again a lot of data for each person we simplified that to 30 dimensions which included the presence or absence of $27.00 common drugs right in each person system as well as well as age race and gender we shared the discovery clusters with Allegheny County Department of Human Services and we were actually very interested to see despite the fact that they had been analyzing the same data manually for several years we were actually able to tell them a number of things that they didn't already know. [00:24:37] So as I said it might not be that surprising now but again a lease back when we did this now also says right it wasn't as clear right but Fenton right is actually driving a lot of the overdose clusters right what is it it's a dangerous drug it's been a huge problem not just in western Pennsylvania but everywhere. [00:24:54] Often mixed with white powder heroin or sold disguised as heroin which is a problem because it's about 50 times more potent than heroin so again it can very quickly lead to overdose and death. In the early 2014 we saw a number of clusters you know ranging from 14 deaths in January 26th a cluster of 26 deaths countywide from friend to know in March and April 2015 I one of the interesting things about this latter cluster is that if we had actually been running close practically rather than retrospectively is we're going looking at each day of data as it comes in right and again this is all assuming right we were actually able to get the data without legs which is a whole nother issue. [00:25:36] But we would have identified a cluster of 4 overdose deaths that occurred on March 29th right so just 2 days into that outbreak right with strong Geographic and demographic similarities so again we've actually realized hey there's something going on hey it's clustered in the southeast suburbs of Pittsburgh clustered among white males ages 20 to 49 right again this would have actually given public health a area and some population to target with interventions now of course right this was retrospective And so what actually happened right was that we had a huge number of fans who know heroin and combined deaths through the end of June. [00:26:13] We discovered another set of clusters that were very interesting because they involve the combination of methadone which is actually drug used to treat heroin addiction and the prescription drug Xanax So it turns out that this combination will get you high but it will also kill you. So from 208-2012 we observed a number of these clusters of overdoses based on a combination of methadone and Xanax each cluster was fairly small about $3.00 to $7.00 cases each but was local a straight in space time and some population. [00:26:43] Interestingly though from 2813 to 2015 these clusters essential disappeared right we saw big drops in methadone and methadone and Xanax deaths overall and none of these overdose clusters so that leads to a lot of interesting questions right why did these deaths cluster right when methadone deaths in general did not as well as what factors could explain the dramatic reduction in methadone and Xanax overdose clusters so focusing on the 2nd question it turns out that we believe well we believe that this was actually the effect of an important policy change in the state of Pennsylvania right so in particular the passage of the methadone death an incident review act in October 2012 dramatically increased state oversight of methadone clinics in prescribing physicians and we believe that this was actually one of the main driving factors behind this reduction in these overdose clusters know there are other confounding factors as well you know either good things right which includes the approval of generic Suboxone right in early $2131.00 thing this did is it lowered the cost of that sort of treatment right as an alternative to methadone clinics So again there may have actually been multiple driving factors but we believe that the policy change was really important in its effect on reducing overdoses. [00:27:57] OK so we've also looked at individual level opioid use monitoring using data from Kansas is prescription drug monitoring program so what we have here is actually 7 years of the identified data from Kansas about 1000000 individuals who at some point have been prescribed opioids with unique patient prescriber and dispensary identifiers right so again these are just like symbols everything's been the identified you know of any person identifiable information. [00:28:24] So what we have here for each patient right is actually the duration and quantity of prescribed opioids right again at various points in time which lets us create a timeline of the morphine milligram equivalent right that has been prescribed for each patient now the question given all this data of course is can we use these timelines right to identify early indicators that are predictive of either later opioid misuse or that includes things like doctor shopping right or and scrape unsafe prescribing patterns and of course we'd eventually like to tie these into overdoses right to be able to identify and prevent risk for future overdoses again in the early stages of these emerging behavioral patterns so number of different ways to do it our current approach actually uses an off the shelf but very useful clustering algorithm called Shape right and one thing that does is a group of patients with similar patterns in their enemy timelines then the question of course is are some of these patient clusters associated with higher risk of either overdoses or red flag behaviors right that might indicate misuse or unsafe practices the other question is how early right in a patient's progression can we do this right what if we don't have the whole timeline right we only have obviously the 1st part right up to the present day right can we confidently affect it can constitute confidently assess the risk of future red flags given these partial and any time lines. [00:29:52] So the short answer to both of these questions is yes right these clusters actually behave very differently in terms of common risk factors including use of multiple opioid types number of unique prescriber zip codes so if they get some is traveling to multiple different prescribers in different zip codes right we might have something going on combined use of opioids and bends as we talked about and so on we can also do in the early individual level risk assessment again by classifying these partial trajectories right so maybe up to this point in a person trajectory where we might not we might not believe that they have a whole lot of risk right about it after at some point right when they're opiate spikes up right and we see these sort of increasing trends right now we can actually realize hey this person gets assigned to a high risk cluster we see the enemy is bumping up and so now we can identify this person as high risk and possibly in the need for some public health intervention right hopefully prevent future overdoses. [00:30:50] So what we have and again in this setting right is a number of new methods that can be used for early warning in advance forecasting of overdoses again at Geographic so population and individual levels we've done again a lot of retrospective analysis right but they suggest how high potential utility for a prospect of drug overdose surveillance actually given public health the opportunity to intervene and prevent overdoses we're currently working with a number of people again you're in the public health domain and again we're really trying to work with public health departments to deploy targeted interventions to prevent overdoses and perhaps most interesting lead to actually evaluate the effectiveness of these interventions through randomized trials so moving on to the 3rd topic I want to talk a little bit about a safety net for public health. [00:31:42] Now as I mentioned previously one thing that public health has actually gotten pretty good at is this idea of Syndrome X. or valence they have a bunch of specific things they're looking for influenza like illness hemorrhagic fevers Zico right etc so they can monitor these sorts of cases right in space and time again we've given them tools to detect emerging patterns right in this sort of data right but all this sort of begs a question about well what else is there right what what is there in the data that we don't know about right but we should know about it so this kind of brings us to the idea of what we call precent Romex or Vale So again this is how you do surveillance for the things that public health has not already included into syndromes right and is not already monitoring right so imagine what we have here is emergency department data right here we're free text chief complaints right we have a bunch of others maybe structured information about each patient so demographics space and time which hospital they're at and so on now the challenge given all this data right is that we can't just create a zillion syndrome trying to identify every possible future event right of potential public health importance and so what we need is a method for identifying relevant clusters of disease cases right by doing it from the free text right again not realize not just the ones that correspond to existing syndromes this is actually something that a number of public health departments identified the need for it right even before we had any tools to do this there was a consultancy through the International Society for disease surveillance right for this clearly was a very pressing need for public health so what's the problem why not just throw existing methods at it well again typical approaches like the ones I previously described are very effective ad detecting emerging outbreaks if you have a commonly seen and general pattern of symptoms. [00:33:35] The question though of course is what happens if something new and scary comes along right again something like I'm not just coughing I'm coughing up blood right or something that's previously unseen and novel right in terms of symptoms right my God This patients nose just fell off right so again you know these are things that you would think right it would not take very many cases right to realize that we had something new different and potentially scary right in need of public health intervention on our hands there the problem though with these existing systems is that these specific chief complaints typically to the board just get mapped to some broader syndrome category coughing up blood that's respiratory nose falls off yeah he knows maybe that's respiratory as well right and so the the problem then of course is that dilutes the outbreak signal right which could delay or even prevent detection of these newly emerging outbreaks. [00:34:26] So we develop the solution again which we think of as precinct remix surveillance that combines both text based and event detection approaches so both topic modeling on the one hand and multi-dimensional subset scanning on the other and you can think about detecting emerging patterns of keywords in this free text chief complaint data so the way we do it right is actually using a variant of top modeling right a variant of the. [00:34:52] Allocation approach right where you can imagine that each of these chief complaints right is generated right as a mixture of topics were each topic in turn is a probability distribution over keywords and the advantage of that sort of modeling approach is that your set of topics is actually learned on the fly from your data and when we do this right and public health data right these clusters of of keywords are end up actually becoming sort of the set of syndromes right one might expect in public health right so we have one topic is vomiting Arjen diarrhea dizzy light headed and weak etc. [00:35:31] So now the advantage right is that we can actually classify cases to these topics and now actually look for trends right not just in these existing syndromes right but treating these topics as new syndromes So again we can look at the time series of in this case hourly counts for each hospital for each age group for each identified topic and again do this with a multi-dimensional scan approach. [00:35:53] So another way to think about it again is that we're doing this large search right over again a combination of some populations and topics and hospitals right for each combination of hospital age range and topic we're going to compare the count right which is the observed number of cases in that time interval that a match on the hospital age range and topic right to be expected count right where again you can do that in a variety of ways right we did a movie a moving average adjusted for day of week adjusted for hour of day and so on. [00:36:24] Given all this information right what you do you can compute the log likely to raise your score right and again if you make certain assumptions on distributed counts a multiplicative increase under the alternative hypothesis the score function or the log likely to raise your simplifies right to a simple and easily computed will function of the observed and expected counts right for a particular hospital age range and top again we can then return cases corresponding to each of these top scoring subsets right as the newly emerging clusters rather than public health could focus on. [00:36:59] So our one wrinkle here right is that we're not just doing plain vanilla. Allocation we're actually learning 2 sets of topics right we're learning a set of static topics from the entire data right that describes again the things that typically occur right in public health settings However we're also learning a set of newly emerging topics which are designed to capture rare or novel diseases begin with possibly never is because before seen combinations of symptoms read essentially things that are not captured by the set of static topics so again that's learned just over the most recent data and again is learned in such a way that we're identifying these newly emerging topics that don't correspond to anything we've discovered so far. [00:37:45] So we applied this approach to data from New York City's Department of Health and Mental Hygiene we're actually working very closely with them and actually just handed over a tool that they can use for prospective disease or surveillance but this analysis is retrospective right we had 6 years of data that 28000000 emergency department visits again each one had a free text chief complaint as well as structured information from about 53 hospitals in New York City. [00:38:12] So again this data is really noisy very difficult substantial pre-processing was necessary because the size and messiness of the data that figure on the right by the way is all the different ways that we could spell vomiting in our data actually all the ones that appeared at least 15 times in the data right and again this is their real and noisy data set but in any case after that pre-processing after cluster detection we were able to actually find a lot of really useful things in the data right that was not aware of so what we did we did it we did a blind evaluation right basically we showed them the stuff that we detected as compared to a state of the arc you are based can approach and for each of these cluster that is they indicated whether the cluster was relevant to public health 1000 something they would actually maybe want to go out and investigate right if it was meaningful that could be something like a car accident where it's a real public health cluster but maybe not something they particularly want to do investigate or not of interest so basically just garbage so we are able to see that the multi-dimensional semantic scan produced a lot more both relevant and meaningful clusters then the he would base can approach. [00:39:21] And so again actually gave hope like health right a lot of things in their data that they did not already know about what were some of these things well I think the most interesting analysis we did is we said OK what happened actually in the week of data following Hurricane Sandy right so again a major storm impacting New York City what were the sort of clusters that emerged in the emergency department data following Sandy. [00:39:46] Well sir turned out we saw this really interesting progression of cases so the 1st few days right after the storm a lot of acute cases right we got clusters of false clusters shortness of breath leg injuries etc few days later we had people coming in a cluster is a mental health disturbance right so again people coming in with things like depression and anxiety and then a few days after that about a week after the storm we really saw the entire burden on the medical infrastructure resulting in the storm people coming in with things like methadone maintenance and. [00:40:19] Dialysis treatments and so on basically things that should have been done right in local clinics where because all the clinics were closed they had to come into the emergency department to do this right so again this actually gives Public Health and Hospitals read a good idea of what do we need to do right in preparation right for a disaster right what are the sort of things we're going to have to deal with now of course right in the rest of the timeframe we find many other events of public health interest right all sorts of different accidents for a motor vehicle ferry school bus and even elevator accidents or a lot of contagious diseases and a lot of other things right ranging from pepper spray attacks to carbon monoxide poisoning and of course plenty of drug overdoses so here's an example of one cluster we found which was a cluster of 9 patients coming into the emergency department and what was their chief complaint essentially they all came in believing that they had drank tainted coffee with glass and now of course this is not something a public health would typically be looking for right but it is something right that they need to know about OK so I showed you that evaluation pretty quickly and one thing to note right is that even though semantics again is able to identify a lot of clusters that are real public health events public health doesn't care about a lot of breaks So again things like car accidents are not going to go and investigate every time a cluster of people come into the emergency department because it was a car accident. [00:41:45] So what do we need to do is actually focus the attention the system on the most relevant clusters and the way to do that of course is by including our public health practitioners in the loop and incorporating their feedback right actually learning models in the system so essentially what we have is now the set of topics is actually allowed to grow over time and we get relevance feedback from the public health users as to whether any Each of these is relevant or irrelevant right so essentially what happens now is at any point in time we're identifying not only write the novel right newly emerging outbreaks but we're also emerging clusters of things that we've already identified right that they care about and want to monitor we're distinguishing those right from the irrelevant costers right that they don't want us to put to report so again this approach is actually really helping to zoom in right on you know the set of things that public health care is about while avoiding false positives or at least distractions right that they are not interested in investigating So what we have here again is a safety net right it's a way we can supplement existing systems right by letting public health know when something unusual or something newly emerging comes along and we develop kind of one of the 1st methods for doing this right allowing us to accurately and automatically discover right these case clusters corresponding to novel disease outbreaks and other patterns of interest right so I'll move on to the last topic I guess I still have a few minutes to do that and so so this is a little bit different than everything I showed you before because now we're actually focused on the causal question right how we can identify the causal effects of environmental exposures and we've gotten particularly interested in the connection between housing and health. [00:43:35] So what we have here is actually a really rich data set of Medicaid data Rattigan Medicaid right is primarily data for low income individuals. Now that data right is laid to detailed building characteristics right again you know going by address right and so what we're trying to accomplish here is to identify the impacts of poor quality housing on chronic health so there is sort of a you know a intuition right that bad housing leads to bad health right but what we want to do is quantify that where we want to be able to say which bad housing conditions impact which chronic health conditions for which sub populations in to what extent So again that's really the question where are set of questions right we're trying to answer. [00:44:25] Now of course the key idea here right is that typically these sort of affects either of a treatment or an exposure right have a lot of heterogeneity rights are going sub some sub populations might be affected a lot others might not be affected at all so in fact what we can do is we can use a variant of this multi dimensional subset scan to identify the most affected so populations as well as what is affecting them right again in terms of bad health and how they are affected right in terms of the health conditions right that are actually increased so of course for a lot of challenges here right one is that if there may be a lot of known confounders right one obvious thing is selection and treatment or in this case selection to exposure right why do some people end up in particular housing right it may be it's not you know it's not random assortment right and certainly at a minimum we have to control for things right like let's say. [00:45:20] People living in supportive housing rescue and supportive housing because they have certain you know mental health issues substance abuse challenges so on right that the housing is not causing those health conditions and so we want to be as careful as we can not to confuse cause and effect here right so at a minimum all we can do is we can adjust right for these known confounders using techniques like propensity score match or right this doesn't necessarily get away from all the potential causal challenges but it helps the other thing to note right is that we need to account for multiple hypothesis testing red again we're doing the search over many many some populations just by chance we're going to identify where we're going to pick out subsets of individuals that seem like they have increased rates of illness where we want to avoid such false positives or more precisely we want to of 2 down the rate of false positives by correctly adjusting for multiple hypothesis testing reading and using a randomization testing approach like I previously described. [00:46:20] So to do this right again we actually have a few different steps right the 1st is to even just try to get at this question of which health conditions are actually the important it's right there are many many different diagnoses right in the Medicaid data read our prayer and we have some ideas of which ones might be important for health right but we're not sure so in this case what we do is actually something pretty simple simple which is we built a predictive model at the building level right just answering the question of which combinations of diagnoses writes I think about the count of individuals in the count of individuals both adults and children right who had 65 different diagnoses right and what we're trying to predict from that right was actually a kind of interesting gold standard right which is which buildings actually showed up on New York City's landlord watch list rates are going to sort of an assumption right these are kind of some of the worst buildings right just overall right in terms of building conditions so it turns out that actually several of the things we were looking at were predictive of a building's occurrence on the landlord watch list including adult asthma and C O P D. [00:47:27] Childhood mental health disorders specifically A.D.H. T. and adjustment disorder and then injuries to both children and adults so again this sort of gives us a set of things to actually look for and now we can look at OK if this is the effect right let's actually pick up heterogeneity in these treatment effects. [00:47:44] So again we developed an approach that's a variant of the multidimensional scan looking specifically for heterogeneous treatment effects again comparing exposed and non-exposed individuals and to do it when we do that we get results like this one right which is we see that crowded housing rate is associated with increases in respiratory conditions and injuries among the Sub Pop elation of Asians living in Manhattan. [00:48:11] So it's kind of interesting right now what's going on here well we think that the respiratory conditions is actually not so much a building level of fact but a neighborhood level effect which is that a lot of the crowded housing right for Asians in Manhattan is centered in Chinatown a lot of the non crowded housing right again among Asians on Medicaid in Manhattan is scattered right other places in the city turns out the Chinatown has really bad air quality right because of all the buses sensually stopping an island their engine because of a lot of other reasons and so again we believe that a neighborhood effect is actually driving that condition injuries we don't know about right we're still exploring and again it's something that public health is interested in right but again we don't know the answer yet. [00:48:56] So one last thing right which is you might be wondering well what about unobserved confounders turns out that this is just one way to approach these sort of causal inference questions Another is to actually look for natural experiments and we've actually developed methods that can do this sort of discovery of natural experiments automatically systematically and at scale and so one of the really nice things there right is that we can get a lot of unbiased estimates of treatment effects among some populations and then we can try to connect the dots right now understands what sort of things are affecting who right are in the context of treatments right what sort of treatments are actually effective for which some populations so again I think this is a really important area and it's one we're actively weren't working on right so I think it's time for me to wrap up but again you know how can she learning us as public health right the answer is in a lot of ways right ranging from early disease outbreak detection right helping them project right interventions right that might actually have positive impacts in reducing the crisis thinking a lot about environmental health and finally in actually opening their eyes right to the vast set of events that might be occurring in their data right that they might need to respond to and are not already aware of. [00:50:11] So I'll stop here I thank you so much for listening and I'm happy to take any questions thank you thank you so thanks for your interesting talk so a lot of what you were able to accomplish relied on. Organizations and cities to provide data sets to you data on sensitive individuals. [00:50:40] Can you comment on how to. Sort of have this data more open to different researchers while also respecting privacy. Yeah that's a great question so the question is sort of thing about the tradeoffs right between between privacy and utility right in these sorts of public health data and I guess there's a few things I'd say right one is that a lot of what we can do right we can actually do with aggregate data right in that sort or in those cases right we don't actually need individual level data right what we actually need for the analysis is essentially count data now there are other examples right where we do need individual level data but we don't need identified data we don't need to know exact addresses right maybe we just need to know something like zip code level that right in again you can see these tradeoffs right because again you know in some cases right now break might affect a Sub Pop elation and so actually knowing some of those characteristics right of individuals actually does help us right detect the outbreak both earlier and more precisely. [00:51:46] You know so I think the answer is again it depends a lot on the data already and what it represents right in some cases right we. You know are using open data out right now other cases we're using data that's governed by you know very very restrictive data use agreements that make sure we keep the data secure and private and again those sorts of things are much harder to open up I mean it's like yeah we'd love to share data with everyone right will because of what this data represents we can't. [00:52:16] Now one thing that we can do is actually produce synthetic data sets right that have a lot of the properties of the real data sets that help in terms of developing methods right but it doesn't help as much in terms of actually answering some of the fundamental public health questions right that our study of the field is addressing So again I think all this is to say that it's a very hard problem right and that we don't necessarily have all the answers. [00:52:43] Some very nice screen shots towards the end of the talk I was curious what how close you are to some kind of deployable system for public health agencies Yes So at least for. The in the church. Yes or for this tool again for the precision of surveillance we actually have produced a tool right that is now essentially being evaluated in beta tested right by New York City's Department of Health and Mental Hygiene So the hope is that we can kind of work out the kinks with them and again what once we actually have something that's fit for public consumption we can actually open source it and provide it to other public coffee agencies. [00:53:25] OK I'm going to ask you when. This is a very timely question as somebody who works on gerrymandering. How do you get when you when you're recognizing something how do you get people in policy to recognize the value of your work and understand it so that they're willing to make the right choices. [00:53:49] Also a really hard question. You know a lot of our work has been sort of working very closely with particular organizations right so the this could be you know and public health departments law enforcement in some cases city leadership but a lot of the problems we address really are operational questions right so again there's an understanding that public health wants to identify emerging outbreaks we give them a tool to help them do that better. [00:54:18] In the city of Chicago right we actually have a tool that's still operational that's predicting and preventing rat infestations right again it's just a question of we have the resources where should we deploy them so in a way those are easier questions to answer a harder one might be when we actually want to influence policy like you know what you know we've realized you know this particular sort of environmental exposure is having a big impact on health nobody's currently doing something about it why do we do. [00:54:47] That again that one's a little trickier to pass one challenge is you know who are the right context right you know so of the city agencies working on earth or working with you know is there some we can just go to and say help us with this sometimes there is sometimes there isn't if not sometimes you have to take an alternate route which is an advocacy route brand that might actually be connecting up with public sector organizations right who care about these specific topics so for example we've been doing a lot of work on thinking about how we can make policing more equitable right and so word again trying to hook up with that you know public sector organizations right who care about these same sort of topics right and again you know hopefully leading to something working with the police departments and city leadership as well right where we can actually achieve some of these goals. [00:55:36] Thank you please let thing Daniel again her lovely thank.