So hi I'm deaf now and today I'm going to be talking about my recent project there how moment from data to insects. So the whole thing about the thing is there was a time not so long ago or getting your hands on good data that you really really difficult like this poor interviewer here that had to ride a horse across the country asking farmers how many colors they have and of course today it's a completely different story today you have the cars actually telling you them selves where they are and what they've been up to today. Try to get in your hands and they say it's no longer much of a problem if we have more that then we know what to do it. And this is actually great news for everybody right because last data has potential to transform almost every aspect of your life you can think of right or science and business and sports and politics and public health and really it has the potential to affect almost every one of the societies most pressing problems. The saying is there is potential that we realize it's not enough to store data and collect it and not have enough to be able to access it that you actually have to be able to understand it to make sense of it to turn all those heaps of data into insights into actionable items. So this is where I come in and my goal is to develop complications approaches for solving this problem data into insight and these are some of the research question I'm most curious about first of all. What is inside anyway. And given lots of data. How do you help people understand it and it's in a structure and what are the interesting bits and pieces. And also with that. How do you build to a literal physical to discover it and making some sort of a just show you how this works in the domain of news. So here's a scenario suppose you're trying to sell some really complex news story OK The Greek debt crisis or a presidential campaign. So what do you do but most of us would go to a search engine like we all love search engine think in search engines are great richer in the bits and pieces but they don't show you how this fifty seven million is all fit together and it's absolutely no structure and there's been plenty of research on incorporating structure into search engines summarizing news but most of it just boils down to a timeline OK like this is the time about the. Greek debt crisis and I'm going to climb this stuff summarization it really works of simple stories that are linear by nature and real stories are nothing like linear. OK if I had to draw a picture of the debt crisis. It's not one line it looks kind of like this. So what do you do when you're trying to understand this spaghetti stories. Let me show our holy grail. How many of you have seen issue maps before. OK that's kind of what I was expecting this is a set of seven poster is about the greatest I debate about whether computers can think damages them in space with a graph for each node is an argument like machines can't have emotions you have edges so this argument is supported by that argument and is disputed by the other argument. So I suppose look at this and just understand the big picture. OK And this is that everything I've sort of love and I was just standing there and a quarter of CMU studying this for an hour reading the whole thing and it finally clicked in my brain. OK everything made sense. So I immediately started jumping up and create How do they made those beautiful creatures and then it was a manual effort it took about twenty many years. So my next question is OK great. How do you build those things automatically. So let me show what we did. If you want to still complex creatures so we're going slightly simpler and the system I propose is called Metro maps and they then that your input is a set of documents you can think of them as that result of a query for example documents by the Greek debt crisis. OK And I would put you take this document and you summarize in visualize them so it's really a map. OK we're each metro line is a coherent story line is that each metro stop Here's a news article and this one tells the story how the bones were junk and it had to come with austerity plan to get the bailout and if I would focus on different aspects of this plan for example would be what had a process in strikes in Greece because of the austerity plans and this is the actually intersect the now take over thirty with both that trigger for the strikes on the precondition for the bailout. Can again you might have another line of Germany and so on and that is just like before you're supposed to. Able to get both the temporal dynamics and the structure so that the major storyline the how everything connects OK. Good. So how do you go at finding those good maps. And if you wait really hard because you don't know what you're looking for if I show you a good map then you know it's a good map but it's a very intuitive property. So half of what I do is first of all trying to understand an intuitive level I want the properties of a good map. For example if story is coherent and creates a how you formalize this mathematically and you come up with an objective fact that actually captures my intuitive idea of court hearings and finally once you have an objective view. Have you it's you know. Everything it is the idea is N.P. hard to hard you go about optimizing it there are approximately one solution. Because this is pretty much what the research looks like intuitively understand what makes a good map formalizing optimizing So let's talk about it. First I get that stuff for a second and just think that makes a good map. Well I really give you the first property for free right so each story lines coherent and great how do you define coherence. So we had an entire paper and coherence in K D two thousand and ten we were trying to see given a chain of articles How do you scored the coherence of the change. The gallery just go you through the high level idea and the idea was that you can just look at neighboring articles or get consecutive transitions along the chain it's not a local property there has to be something global. Let me show what I mean by it's not local. OK So this is a document is a word the barman did the words on the left appear in the article about it. So suppose you're trying to build a coherent chain. OK so this article about the Greek debt crisis and the second thing you're going to find is probably an article similar to this. So for example what Republicans think about the debt crisis. Now if you completely forgot about the first article and just try to find a third one A similar to number two and I come up with this one. Where the pope sings about Republicans just keep on drifting drifting farther and farther away from your original things and you get a staircase behavior to see the stream of consciousness. They should. So each transition is strong but it sure isn't is about the completely for. OK let's try this again. Same article by the Greek debt crisis same second article what Republicans think about the debt crisis but they said we remember the context and you remember where you came from and you know that Republicans are not your focus because you'll keep on finding since there are more and more similar to the agric this credit story. OK And you consider a much smoother and nicer transition. OK And also a small number of kept fourteen captured the entire story is just our intuition before formals incoherence. OK So let me tell you about how this works. So if we want a coherent chain. It's the first thing we want is that all transitional strong which means we need to find a score for transition. The first thing we said was a transition between the UK in India and there plus one just some over all the words and if they share or do score a point. Let's just number of shared words. This is way way too coarse because first of all some words are more important than others and also you know words are too noisy because if you have a document about lawyers and judges but not juries and maybe it's kind of there. So you have to replace it with an ocean to recall the influence of for W. on the transition between Dr India and the and what you need to know is that inflation is high. If the two Dawkins' are related and double plays an important role in what makes them related. Even if it doesn't appear in either of them. So this is the score for a single transition if you want all of this transition to be strong. You really care about the weakest link. They want to make the weakest link as strong as possible. So there's plenty of strength in session but remember what I just told you a cup of slight there about a staircase behavior you really need there should be some coherence some global thing going start entire chain. Second visitor out there is there is this global thing there's a small number for that captures a story. So not to do just to really turn this into an optimization game OK So OK fine small number of forts and I suppose for example I tell you have a budget. I allow you to pick three segments out of here and I'm going to pretend that these are the only words that exist in the document because you. Because of those three and then your weakest link. There is nothing we can document do you want it to score easier but if you end up taking those words and you're actually your weakest link is much much stronger. It's about finding this small number for that captures a story that's in service summarizing over something over all the words we sum over the words what you call active and it's really a problem of finding activation patterns subject to some constraint or trying to imitate its behavior of course here and change. A small number for that kind of thing. OK Breezin without we find our notion of court hearings. Again we solve this by linear programming and then applying rounding. So we're done with the first property coherence and the next version should be great. So are we done. Can I just pick the top three coherent chains and call it a map and show what actually happens when you do this. So this is the query grid that and those are the top three coherent chains that we had one of them is about the Asian markets and two of them are about strikes and riots. So what's wrong. Well first thing that's wrong. You have a budget of three chains and really Asian markets are not the most important things happening where there is a lot more important things like Germany's role and also the bottom two allies are really redundant. There is no reason to for the map to include both of them. It's of the challenges relative balance coherence was what I call coverage in the map should cover the first topics that are important to the user knowledge to me tell about how this works. So I keep calling it coverage to the first thing we have is the element that they want to cover in this case you can think about them as words like Obama and China and they also have a probability that each document covers each word. OK so it can be for example based and idea. And since I want to. The important things. I also have the importance of each word how my I care about it and if you don't know I think this can be based on frequency you know everybody's talking about Obama he might be important or this is also a perfect place to plug in personalization if you happen to know that for example you care about sports more than politics. And we said that our coverage functions you should be about important things and should be diverse you want to try to form up this idea of the verse if it really boils down to some diminishing returns saw me just show you what it looks like. So these are the words in our corpus and the size correspond to frequency and in this case the importance and you have words documents cover words. OK So this document. Can you actually see the color there. So this document covers. I think flickering. The thing I've got a project of my favorite this blue document cover somewhat of a bomb a somewhat unfortunate on a somewhat a few S. and then you might have another document that almost such or it's New York in the U.S. So this really pushes you to look for new documents for the map that are both important words that haven't been covered yet because you might go and find this green document that's about Gaza and Israel and. So just mathematically each of the document of the map tries to cover each word with comparable it just slips a coin. And then we this is the provided at least one of the map document succeeded in covering the words. OK if you're some independents and then you just so this is how much the map managed to cover award and they sum of all of the words. And you weigh them by how much you want them covered. It's a best it's some other words how much they cover and how much they care about covering and this is coverage. I think good. So now we have two things coherence and coverage and how do they play together so. Out fully I managed to convince you a day markets example that careers. You don't necessarily want the most coherent chains in your map for a day Asian markets example but it's rather more like you constrain a Chinese as a coherent enough to appear in your map or it's not. And coverage is the thing that you're really trying to maximize because I feel about finding a coherent map that's achieved the maximum possible coverage subject to some size constraints that I should have been going to try to optimize this objective to get us through. Lies one about strikes one about Germany and one about I.M.F. and we should have to write a very important things if they are various. What's wrong. No I'll come on you're all thinking that it's not a matter I did this at the disconnected line and it's especially frustrating when you see this bottom doctor in just about Germany and I.M.F. And you really should have been connected to the red line but it's not it's our last days connectivity. If two chains. You know are related and I want a map to reflect. There are many many ways to form of this notion of connectivity and we did some user studies which people could actually care about and tell the difference. People got really upset if they knew the story and two lines that they were related were not connected in the map but it isn't to care much about how they were connected OK beginning and one point multiple points. So we started with a very simple objective just to encourage intersections. Every time two lines intersect your score point. OK Nice and easy just look at the fines if they intersect you score a point and letter we went on to much more complex things but this is our first shot. OK So this is what we have to coherence is this linear programming problem coverage is nice a motor function and connectivity is this very simple line intersection objectives. So how do those who play together. So curious is still a constraint and price with coverage and connectivity really want both and the thing is if I show you a map to super well connected at both ends. You couldn't care less about. Then you probably still care about it. Because the coverage is really our primary objective and connectivity is our secondary carry not towards what we want is we look for all coherent maps that achieve the maximum possible coverage now to those who want to find the most connected one. See this is a graphical optimization So the first term is infinitely more important in a second. Good. So I'll just show you one slide overview of how we go but solving this problem. It's all stuff from sort of documents again you. Think about it as a result of a quirk. And I said is coherence is a constraint idea I'd like to enumerate all possible coherent chains that could be perhaps metro lines. But that's clearly unfeasible so to do is we encourage them in destruction do they call the coherence graph because each node is a short coherent chain and an edge means that you can concatenate them and they'll remain coherent and it's a transitive property surpassed in this graph correspond to longer and longer coherent change. Next thing you say well OK now I just find some passing discussed to use as my much reliance so I really want to speak a math class that cited the underlying documents maximise coverage. So to speak a pass from this craft are going to be coherent and are going to maximize coverage and like a license. Our coverage function is the most sure this is exactly the same problem. So use this very nice. I'll go to my to current policy. Paul is I'm a cursive pretty and have approximation. So this point we have high coverage map that square here and not last thing we do is we have a local search step trying to increase connectivity without sacrificing coverage like I'm sure to look like. So we started from this over simplistic agric that Chris map the real thing that so has a line about how great struggles to stay afloat and they need help but it is good enough. In another line but it strikes riots another lie about Germany and a tiny line about day. And that's funny. If you don't look at it with a well OK it's a cute image but anything but is it any good. How do you value this thing. And if the thing is challenging because we don't have any ground truth or golden standard and we can't really I mean you can use all this machine learning data mining types of surrogate you've got to ration but I actually think that for this kind of problem. You know if maps are good then use that is a fundamental You have to see if it's actually useful for anybody. So here's our study question. So can those maps help people help news readers action to stand a news story better than current to. And we had of the north and article from two thousand and eight thousand and ten when we picked three stories that we wanted to see if Mets are good with them. Man miners trapped in Chill out of the earthquake in Haiti and my favorite Great That's right. So it's all about the first study who did so we just tried to see if maps are good for simple question answering because we asked people ten questions like how many miners were trapped and we measure how much they knew before looking at anything. And after looking at stuff that was either our map Google News or something called stopping detection and tracking we had about three hundred forty users and. Well we're doing better than the competitors but nothing to write home about. OK nothing major. And then I talk to people on this you know if I wanted to know the name of the Greek prime minister I would Google it. I would go to a computer like it's really an overkill. And they were right. I think that maps are NOT feel about the small things about this control F. kind of question. It's about the big picture. So yes the second thing we tried. OK So basically tried to see if maps give people a high level understanding. OK so how do you know if somebody has a high level understanding of the topic. Now if you were ever actually a your teacher in a class you should tell it the only way to know if you really understand something is whether you can explain to somebody else. So what we did is we ask people to look at maps look at Google and write one paragraph explaining this thirty somebody who has absolutely no clue who can we do it and then we think this paragraph with rhythm and Mechanical Turk double blind pairwise compares on and off people. So which one does a better job explain the story. And the results were way way better and we had fifteen paragraphs right there is about three hundred evaluations. And especially for complex story we had much bigger gains. For Greece for example we had seventy two percent people preferring paragraphs generated by MAP users now identified as a great for Haiti were a bit under sixty percent I was curious about why so. And looking at the actual paragraphs and there's not the map. OK that's one major story line progress. Earth craters damages the relief efforts. And a bunch of smaller starts about what was it like Cliff vision running for presidency and. Mission or the kids of kidnapping children that kind of things and to other people who summarize the subject but the focus and the major storyline and ignored everything else. So I think that the bottom line that mattered the most useful is a high level summaries for stories that don't have a single dominant storyline. If it does get the stories we started from. Some math demonstrate then in the news the main but now I actually want to convince you that it works in a whole bunch of other domains. And my point is that they're really really easy to add up because the underlying idea of school. Here is covered connectivity remained the same is that you might be able to use domain knowledge to come up with a slightly smarter objective. And I'm going to show you in this example the science legal documents and books. Software with science. The idea of this project was basically to see if maps can help people for example first year grad students understand the state of the article worth reinforcement learning heading to. So we had lots and lots of A.C.M. papers and there were slight modifications to the objective especially taking advantage of the station graph. But algorithm basically stayed the same. So let me just show an example. So this is the map out of reinforcement learning I'm not actually expecting you to read it. Just look at the. Major research that found with one about THEM OUT age incorporate is something. When about M.V.P. come to peace. The great are about controlling robotic arms one about bend It's an expression exportation dilemma and far as this line about the Syrian bounds and what are you said it is connected but there is this funny crazy things between them is because understand different than I would for in the react connection that I see even if the lines aren't intersect but they have a lot a lot of citations between them like a Syrian application blind and I still count it as a connectivity. Server. Two different pieces of me. No here is paralyzing. So what I want is that each line is coherent. Yes I still score. I mean I don't want the entire scene to be connected. I just wanted to be as connected as possible but if you start from a super broad Querrey then it's not going to be connected. If you start from Obama. You're going to have a whole bunch of different things that he did. I'm just trying to get as connected as I can and by the way it works against coverage for exactly this reason I can just go into show a little bit about this. So you can see for example how the Syria line had impact on their bended plan and exploration for station. And you can see how the M.P.P. line had an impact in both the cooperative multi-agency thing and on what was it. Robotic arms like. OK good. So again with the way you said it was actually really fun. So the goal was basically to see if math can help a first year grad student. So we got a whole bunch of people and we told them. OK Now pretend to be a first your grad student who is embarking on a reinforcement learning project. You know your country professor's office all excited I yes teach me everything you know and the professor gives them a survey paper OK. Go read on the front to save every word from one thousand nine hundred six. So their goal is really to identify what happened since ninety nine for the major research their actions and the relevant papers came with the Google Scholar or our map and Google Scholar. Because the maps are kind of a starting point. I don't expect it to capture the whole thing and I so want them to you know go and check that's a good paper and everything. I had a base on the map itself it's not on the slide who's still doing pretty well. So we had thirty people and we basically combined all the paper that anybody mentioned one huge listing without an extra grade. Check precision meaning is this favor good or bad and we think. Maybe the right term for this is some topic recall. It's made a list of the top ten subareas or for in for some learning. What's going to see how many they actually manage to find. Some results in a nutshell the bottom line is that on average a map uses the. In percent more of the papers that they found were deemed relevant. And out of the Stockton area they found about three more on average they were pretty happy this point. And just do it again just a demo say the savior of madness and defeat them. I mean we're actually made a metaphor unrelated work at this point. So this blue line is basically this work from connecting the dots to Metro mass to measurements of different domains. And I am not going to go into detail but just to show you the general area. So there's been a lot of related work in summarization and in particular in summarizing news stories. There's been a lot of work on narrative in the storytelling. Some work on coverage notions. This pink line about basically it's all zation OK from a concept map some mind maps and the thread you on about visualizing science. This is just to give you again a high level overview of who was around us. So this is their stuff the main. And next thing we did with legal documents. They said there is a. Legal search engine company in Palo Alto that read the paper and came knocking on our door and hey can we incorporate this. So the goal is really to help people to help lawyers argue a case and the data they gave us were Supreme Court decisions and well the first fire ran into how many of your have seen Supreme Court decisions they're OK Excellent so they're insanely long sometimes they're like hundreds of pages so make if idea for just not getting any signal but luckily what we found out with when you have to when you cite a case you should have to say why you're citing it is so in blah versus blah they defined notion of something so if we just use this anchor to stick to craft pinpoint us to do it. Interesting point in the long document then everything else falls beautiful into place. Again this is still the beginning but just to show what this is one of them actually completed for them for commerce clause. You can see for example the I guess purple line is about. Who can waive state immunity and whether if you work for a state owned company can you sue them. OK how about the same thing in federal court and there's this thing. What is the test. Section. I don't even know nine hundred eighty three. Does it apply to the previous case or not. So the lawyer said this made perfect sense. And just to get a reality check. We asked him to label it you for the lines. And then we actually wrote down the ward that made those lines coherence for our system. So you can see for example the number three line. OK so in love with Amendment states or variety and the words we had were immunity serenity Amendment and the eleventh or the last line regulating wholesale energy sales. OK so we had wholesale electricity resell steam in retail is wasn't a show about steam but OK and this is a reality check. We're working with them and they want to incorporate it is really fundamental. That's one books and again just for the fun of it. They wanted to see if you can apply books really complex story. To understand the structure for a book and I say book I mean a lot of varying mostly because I I refused to read A Song of Ice and Fire until I finished writing it. So you feel if you found that away that my notion of course here is broke. Completely. Now and used a manger not are nice enough to tell you what happened earlier but books don't really work this way did they tell you now we're done with the Bethlehem disguise that we can go on and do that. It just you know go in and do that. So we didn't stay with the setting that. Character their point of view is political here in Europe so focusing a lot more about a day. Named entities in order to understand which parts are coherent and just to show briefly an example. So you can see the structure flow the rings you have day and Gandalf starting under merry way. And then they meet a lot of other people at the Council of Elrond and they split up to see. Merry and Pippin here some inferno are going to me to go move on. Surely the bad guys are down there and eventually we're going to meet. So it's already showing you some interesting structure. OK. Reason for that. So this is some follow up work we did on maps. OK so. The first thing I did in Stanford was I was worried about scalability and I wanted to head the web scale corpus not just the New York Times so that they redid the algorithms we came up with a hierarchical version of maps and from eleven minutes and seven thousand article that was roughly thirty seconds per query on hundreds of thousands of articles. We're happy. Second thing I did was interaction and I think this is one of the things that actually going to make or break my traps OK because I'm not going to nail the writing you have in mind based on a couple of keywords. So we played was to interaction mechanisms one of them is a small c resolution idea where you zoom in to learn more about something you care about a zoom out to get a high level overview and the second thing is word based feedback where he remembered a coverage weight. OK so how much you care about each word. So I can actually show it in a tag cloud and then let people tell me where I'm wrong. Like I don't actually care about Germany but do go like and then with the dispersant notion of coverage. And one more thing I've been working with a student a semester. It's a really fun project of computing based on different points of view math a controversial topic. You know you type in healthcare and you get one Democrat thing one Republican then how do they view the story differently because this is really fun project that we've been working on for modeling this notion of controversy and how do you represent document based on the sentiment towards the controversial I words. That's about it. That I don't think we have is we have a website on the way it's the final stages of the bagging and I have a student working on making this an open source package. So you can basically plug in your data to make up a foreign to us and see what comes up. This has been an exciting semester. Pick up for my traps. So the point of this project which is take you know and use reader a grad student or a paralegal or really anybody who has a lot of data and tries to understand the structure. OK So people who used to rely on search. Now can get a perspective of the field and see destruction the connection. So we formalize this properties of matter outside coherence coverage connectivity and we came up with an algorithm to optimize them. And follow it is amusing. That is to validate this point I was just sitting there I think OK so what's the next step and the thing I don't like the most about metrics is that they really only show a connection that other people have made before right. So what journals have found out who wrote it but what if you want to discover some new connections. So this is the how dare i project came to life again this is about finding given lots of data. How do you find insightful connections the most interesting bits and pieces. And just like before the challenge is create How do you formalize this notion of insight. OK Just a word of caution. This is still a work in progress. It's a lot less mature than the Metropark But I mean really excited about it. So I thought I'd tell them. So how do you define insight. There's a lot and also for a read you can actually argue that the entire kiddies about turning data inside and I would like to try to read this and try to abstract their way through what sorts Cohen across all this work. And I think that the first property of insight. Well might be trivial but it has to be surprising because if you know about it rather than nobody cares. Resolution is not enough because you know given enough they'll find plenty of things that will surprise you. Most Because you know their bias or coincidence. They feel has to be plausible in the sense of really well supported by your data against a very simple idea very general and I'm going to show you how it actually works how you form other things. In the context of medical insight because medical is the perfect the Maiffret Fredricka. Tones and tons of data going to have more data than Were they know what to do with and there's the potential for many many new links to be discovered. Every day you see all the research has found a link between blind but our goal is really to help researchers come up with gaps in medical knowledge OK identify those promising research directions. OK So let's look at probability and surprise in the context of medical and just for the sake of this visitation I'm going to restrict myself to a very limited type of insight. That's basically a pair of medical terms. OK a connection to for example sleep apnea and there be this is there a connection with. Them. OK. So how does it work. So what makes a connection plausible though it has a corker a lot in practice in hospitals and it's surprising if nobody in the literature ever noticed this. OK so in order to actually make this work. This is the data we had to see what happened in practice. We've got seventeen years of medical notes natural language medical notes from that send for the hospital. And it's about ten million notes and you know it surprisingly download Medline which is roughly eleven middle and papers. Now let me just show you an overview of the system we have OK So we have a search engine. You technically don't have the stuff from a career but it seems like medical research is usually have something they care about immensely right where they know so we made a search engine has your stuff from sleep apnea first thing you do is you go to medical notes and say well what Corker's a lot in sleep apnea patients what are the candidate and the plausible candidate. Next thing you do is you take this pool of candidates and you sort them according to how much they surprise you. According to meddling. I think that. Was that I realized three pieces and if you thought about it. First of all where did with terms come from. How did you form of plausibility how you from a surprise. So in order to get those terms started. Lots and lots of medical terms from those notes and from Medline. And the first time we had it was granularity. You know it's natural language text so people were saying sleep near a clue. You'd have no recurrence leave of me and are they'd be saying can I merge them should they merge them. So we started by using medical hierarchies. OK So we have this kind of for graft my goodness or has two children come on my grand. It's not so common migraine. And we wanted to know if we can propagate things up. I thought when he said there was pretty complete scale divergence So how much information is lost when you can use the parent to approximate the child. So really what you see here is that cone migraine can propagate up to my going to sort of maybe even two vascular headaches if you know it depends on your threshold but if you see this other type of migraine you don't forget it up get this is something. Different. Yes And this is what The Times came from this made our organs much better because before this. Now all those different ways of referring to sing are just completely screen up my counts. Yes Now we had a much nicer grin a large area. And that's from a surprise. OK so what makes the connection surprising first thing I told you they don't appear together in data and in Midland So really you know it's Medline and it's noisy and might appear to get so if they appear together very very rarely under K. times. So that's the first thing. The filter rough read because if this is a like three people in the world have don't appear together in Midland actual a surprising if you know common flu and they're British had something in common and plenty of people are researching them and nobody noticed it. So it's really this times the minimum of the importance and it's the same idea of the importance of the coverage function because they really want both terms to be important and never discussed or is a novelty and utility. Good. So this is a price. Let's talk about possibility and the first idea for the bill is actually exactly the opposite from that private you actually want us to things to corkers together a lot. So what we did is we took all the notes that the patient received in a single layer made it into one giant document. There were basically a fusion between two terms so how many patients had both over how many patients had either. Which is a way to saying what Corker's with sleep apnea patients. And just to show an example of how this is and ran with this objective. So we started from dementia and output was those six turns as the higher the card coefficient of the first year of Alzheimer and two medications used to treat dementia patients which is OK. So they're going to be filtered away by the surprise because the cooker. A lot together in Medline. And then you're left with hip fractures a tree of evolution in wheelchairs and this point if it gets slightly suspicious because you have those hip fractures in wheelchairs. Well is it actually dementia or is it just because you know the general population and to be old. It might might not be dementia. So I say the second thing I want is this explain or explanatory power because I want to. I'm cautious about using the word because I love it here but I actually want to make sure that the mention is the thing that we're after. What we do is we find a group that's very very strong and you measure patient using propensity score matching. And those people don't have dementia but other than that they're very similar based Yes they are he fractures a lot more common for dementia patients and you know than OK we're not finding the I think you're. OK so I want is really this higher carcass fish and who want to be well supported and we want to pass the smell test case to be reasonably sure that dementia is the right thing here. OK Let me show how our system actually doubt when we run it with this formulation. So we start from the mention heap fractures and wheelchairs didn't even past the possibility filter this time so we came up with those four and all as it was last passing the surprise filter was a thrill for Blake. Which at this point if you scratch your head like so maybe you is that an insight. Like what does this mean. So you know ideally at this point I would have an army of physicians I would tell them I feel for blushing to dementia go chase it down but you know I mean physicians and time and money I don't happen to have a lot of five or. So and the question was can we do early discovery. OK So we as physicians it does about recent interesting development medical breakthrough and then we rendered that up five years ago. OK And the data truncated five years in the past. But he's saying you know if we had run this isn't five years ago. How many of those you do it would we have told you if you go chase. And it is if you measure find anything. And it's a real strong indication of the utility and just to show died there. It was again a tiny sample they only gave us four examples. These four. But the best part is that out of those we actually manage to find to get there are thrilled and are currently working and giving me much much longer list. OK So this is a fun one. And just you know I thought of a thing to this is a really general idea. So I'm going to show you how this works in a completely different domain. Domain of Commerce exactly the same formulation exactly the same algorithm just very different input. Since the main idea is to have a serendipitous product search. If for example suppose Last week I want to buy a laundry hamper. I don't actually need or only hamper and a place to store there to close right. So what I'm looking for in this case is our products that are plausible in the sense that you will solve the same problem for me and use some common sense fact that concept not in order to do this. And you also need to be surprising in a sense when you look at Amazon that people who view this if you do this. Nobody who buys a hamper ever even thinks of looking at this other product is the best you have a search engine that out for us is like you you know if your input is a hamper and output is have you considered buying a really big trash can and I was actually talking about this with a friend of mine he says hey I use a trash can for a ham for. Just to the point of the dollar getting worse. Just the same again. So if medical notes are used concept that instead of meddling with the Amazon then people view this viewed as a graph. But everything else is just the same. And I actually started I mean I floated this is similar to the bit I just give you some shopping for an hour out for the major two things I learned first of all that baby products and Pet Products are surprisingly interchangeable. The second thing I learned about you know how Home Products and car products like I said are from. This math in the bath it's basically about for you not to slip and break something so it's just the same thing in the car department. And also suggested buying just tape like I slipped tape which I didn't even know existed about it actually made sense. So and I'm having fun with this one. Recap the Halfords it was a. Getting medical research isn't giving them some promising research direction and giving them ideas and we mentioned early discovery for a cup of medical breakthroughs I said this is more preliminary but. They're giving us more data right now. And I should use them up against another tomato which I go through and if it is printed search I also have a supply to keep Pedia and something called Literature base discovery which is basically the same thing as medical data without the hospital now it's just using them mentally. OK So this is like a good time to you know take a step back and say OK we talk about two things that matter of the number of the project. So it's a common ground with the underlying theme of this. So OK first of all the obvious solution is you know during data entry insight. But it's really I think all fires that falls into this framework of start from a really really fun of the problem. OK like I know how do you define coherence How do you define inside and then how you form of it and how you go but optimizing. Can the tools I use in order to solve it from her basic from data mining machine learning information retrieval on A C I'm in for lots of algorithms especially graph and I showed you how this actually applies to a whole bunch of felt like I feel like medicine and science and news and legal documents and commerce and a little bit of flitter or is this what I do I think about what I want to do now. So first thing that's obvious is you know they said project. I really like this idea of building a set of tools that will be useful for scientists and analysts and web users in order to get a lot of the Danann with whatever is interesting in this data is really enable new discoveries. One really important point is I really like applying my technique to multiple domains and I think that if you can generalize across to me that means you actually have something stronger. So I thought about up again as I did. Now this is the collaboration that I'm most excited about right now. OK So life is so of course medicine but I am I really want to find some bio for this interest Social Studies and Political Studies we got some really nice they decide about criminals and that Congress no. There's a history professor in Columbia that's been excited about applying this idea to telegrams and I guess prehistory there is a group of pollen colleges and not entirely sure they want to do but they e-mailed me like yes let's talk about this. I think I've been wanting to for real and times the plane disappeared so later I think. Especially browsing history because you know you want to plan a trip somewhere and often you find it's everything seventy five open tabs in Firefox and what just happened here. So it kind of a way to say OK this is you know the past you went through those are the threat. Was of a Corporate a Boeing. I just learned what was it that most of the Fortune fifty companies around here. So corporate is the flame trysting financial data. I've been curious about applying this inside idea to an investigative journalism. It's just you know. You find something you find recollects or something. How do you extract the interesting parts. I mean the last part actually really surprised me. I never saw this would work on anything that's not text based but there are a couple of researchers at U.T. Austin that applied this connecting the dots idea to summarizing videos as sequences of images. So our friends our discourse and vision to get us very much it. And one more thing if and if you have any idea I've been dying to apply this idea of maps to the me they don't have a chronological component to them because there's nothing about the message that calls for chronological component but I couldn't find any so if you have ideas please shout. OK this is their immediate erection. Now in the longer term. And there are so many facets and they just way to be formalized right about some of them like insight and connectivity. But I mean there are a lot of others want to do like politeness and trust and especially creativity going to actually start applying was critically semester let me just give you a couple flights overview for they did because again. Really quickly. The goal again with creativity or an inspiration. Generator so my goal was really you have a product you have a company and you want to expand it somehow to expand your business may find another use find another market. So there is this model called a scamper model here which is basically a serious question like What can you combine it with how can you put it to another use. Isn't my favorite example of this company and its use. You have shower heads like water pressure now they're making dental floss based on the same order pressure is just what can you do with your technology and we build a prototype system that uses the same idea of concept math and Amazon just trying to answer to scamper question. And just to show you the output of this is in for one example I'm clocks. So you type in alarm clock and if a suggestion like I combine it with a coffee machine OK so you have a fresh cup of coffee in the morning or combine it with a dimmer during becomes larger and larger than into a cup. Maybe have a silent alarm clock maybe something to vibrate you know either for deaf people or for people who don't want to wake up the other person so. Again this is a very beginning but I've been very very excited about the sound. OK. Prison preserve the fun of the stock with that they that can help us understand the world right and make better decisions but in order for this to work. It's not there to sort it it's not enough to be able to search the data but you have to be able to reveal structure you have to be able to really insights. And so you know we formalize all of this we came up with years of study than early discovery experiments to try to voluntary ideas. And I found that the take one thing from the Stock Guide for all of this here. OK so you really need to go beyond search at this point. That's about it. Thanks. Yeah. Well. Yeah why. Access doesn't really make sense out of down tangling. You know. There was. So many. Yes I think it's a very different problem for them doing right because the maps are about story lines are about some notion of progression. Now because I've been trying to apply I swear to so many domains like I know people who want to buy a camera. Do the research and dog breeds and well there are many axes you can sort them on but then it doesn't actually make sense to have a map that's more of a cost you know rank clustering or something like this. So I think of the strength of the map is when you're actually trying to understand there are multiple major things and they intersect and the intersection actually make sense. Yes actually I was very very point of trying through every non-linear things like if I don't have a way of making sense of what the right time line then this is not going to work and I was definitely you know sort of either tooling are or what you call. What I should return memento. Maybe just working backwards would work with. Home and you know. Yes. So the thing I know it's knocking on with Boeing was actually still a structured they were saying that there are lots of user forums in the company and where people submit questions then other people answer them. That's kind of think because you know more structure is better. Right. We did a scientific domain and we actually took advantage of all this beautiful beautiful hyperlinks and also legal documents. So more structure is just going to have to because you have to understand how to make sense of how to make use of this extra information and I don't know enough about how corporate structure really is all I know is they talk to me about you know have all those subdivisions that. Basically solve the same problem and they don't talk to you chose to can you actually tell them to somewhere in the form somebody asked a relevant question. So it's not about who are this more about. You know you're trying to solve this problem and there's another division at the very other side of the building to solve some that would be useful for two but you're not even going to look at our stuff because you know you don't talk to each other. You don't even know they exist. It's basically about merging together. Making Sense. Yes Actually so initiating gave this part enough credit. And then actually did the user study into how people interact with to map out and I read other people just don't use the computer today muy I do get one of my first comments and somebody was playing with it. I thought drag and drop this is so country into the. Well what so. Yes there for me my first visit ovations were no where it's useful and then I actually got to speaking with H.C.I. people at Carnegie Mellon mostly and now I have an HA student working for me and she actually knows all this beautiful things about what colors to use and. What fun to use when to make things show up on demand and not. But here. That's a fair. I didn't talk about because. I think I have forty five minutes. But this was a big source of going back and forth. In those early years where it seems like you have when you are you connecting dots or what seems to be all this isn't really what I was just as I have say stations. So now in the news the my and intersection actually meant an article that can sit comfortably in both live like an article but austerity plans to see them when you get this. Have one line about F.E.M. the Syria faith and you have another lie but a placation to facial recognition and you really can't find a paper that sits in both of them and they still remain coherent. So you have to rely on citation saying well there is influence or impact from here to here. So I don't need to compute anything it's given so I might as well use it. And again if there are multiple connections I'll actually show it because it's interesting to people. So a couple slides before you were showing how to develop new ideas for products to be honest the idea wasn't quite clear what was generated but those were new ideas all those I heard before. So it seemed all that new or insightful to me but that doesn't mean that made me think though that as a research we in our introductory intro to grad student last we often get this I have an interest in grad students guess store. It's a three week at our intake and yeah except it's done over a more reasonable time but exercise they sometimes get is take two different areas in computer science and think about the problem that might bring those together. I think this tool. The fantastic for doing something like that because often particularly people that are applications researching computing they're not about really breaking any breakthrough discovery in computing itself but a breakthrough in how you can take something that computing knows about and apply it to something that hasn't applied so much preaching to the choir if you want. So do you have experience doing that. I mean the medical one that you showed me a little bit along those lines but I had to seed it with two particular problems and there had to say Well and then to have to see the medical way that if you write if you just let it go without the courage to go into computers for all parents just the medical research is done seem to go both very open ended. So they preferred if I'm a researcher off sleep apnea tell me something I didn't know about it and so there's a graduate student commercial opportunity here that you just generate a whole bunch of different pieces topics like you know there is that there is a paper title generator like a random paper thought of the same mix and match from things. We like. There are you get some really good i hope now if you're up and I think about it all the interesting things are and I wonder whether actually you're right here. This this often happens where you actually get over it. Because you're right you're actually really really really so I just wondered what was you know or optimization or the out of the track you know what are the times instead of me you know crafting those objective functions actually trying to learn them and question feedback rectifying going to tell me a map is good or bad when you some expressive form or feedback to understand what you know even just breaking into subproblems So if this came more coherent than this one. And if so I mean the thing about you know. Again those objectives are pretty much what you said are a starting point. But I'm not married to them in any way. So I would go with learning and about approximation well. I mean in order to make this do to speed up that I talked about that we definitely relaxed a whole bunch of those objectives. Is this the way to speed up came from and I wouldn't. Perform or someone drops I'm trying to bring it back up. So there's a some interplay going on. So you just. Have you thanks.