A lot of them were trying to work. With us on the streets last where he was one instrument one for one one for find useful things with the data and he said or is that more than you need to see the. Words features of the old do you want security conference. Some of the words on the MIT told you Forbes and so on the right side of. This room was right the security interests were to take my. Place Where think you for this very sort of nice introduction. This is the first time that I am a Georgia Tech and I think also the first time in Atlanta so it's it's I'm very excited to be here so I'm going to tell you about some some recent the work that we did last year it's so it's fairly recent and basically what this talk is about I'm going to tell you how to build a system that learn to detect malware by reading and understanding security papers. So. A lot of security problems basically involve distinguishing between malicious and benign entities instances or or artifacts and now these we use machine learning a lot for for this problem. This actually Professor Winkie Lee did some of the seminal work in this area both fifteen years ago more than fifty years ago but now Machine learning is widely used in the. In the security industry I know because I actually work there so basically what machine learning does it starts it learns from examples so it starts with a few known benign and but we can you guys see from that side of the room OK cool so machine learning starts with a few known benign unknown malicious examples and then it tries to classify the rest of the examples according to how similar they are to these sort of initial instances so now that they don't have to be identical The point is to detect some new stuff right that is somewhat similar but not exactly the same as before and this general approach is used for detecting spam and mouse where network attacks for predicting which vulnerabilities are exploited in the wild which swept sites will become malicious in the near future for predicting data breaches in probably many other problems and in all these problems the fundamental question is What does it mean for to sample is to be similar right. So let me explain this more clearly with a few examples I'm going to give you three examples where I'm going to illustrate the importance of of this you know how to how do you determine similarity So the first two examples are from from my own work we we use a lot of machine learning prior to this work we did a lot of machine learning projects for for security so in one of them we actually tried to see if we could predict which vulnerabilities are going to be exploited in the wild by mining the Twitter stream right so there's lots of lots of things trying to predict various events using Twitter. The stock market movie revenues flu trends etc are. Typically Twitter analytics involve analyzing the content. The of the tweets themselves as well as some features of the. Of the users So how popular they are how influential they are. Right so so basically features of you know that that come from from Twitter itself so what we found in this project is that this actually doesn't work well enough for predicting vulnerability exploitation we also need to look at the characteristics of the vulnerability itself right so if this is for example a remote code execution vulnerability a privilege of own ability a denial of service vulnerability and this matters because attackers you know are more interested in a code execution vulnerability than in a service honorably for example. And also the C.V.S. The score is a numeric sort of measure of how severe the vulnerability is so this also turns out to be important so really the key here is that in order to do this work you need some domain specific features the features of the vulnerable itself to come from outside of Twitter so I put here this leak I don't plan on talking more about this paper if you're interested you can check this link and or or ask me about it. Later on so that's the first example in the second example. We try to detect malware delivery on the client side. So what this means is that we want to or look at what files are being downloaded on the host and in particular who downloads them. There are many attacks that basically many pieces of malware to download additional payloads from from the Internet either perhaps because it's the next stage of the exploit or in some cases these are just generic droppers the just distribute malware you know unrelated you know this is their business they. They just distribute now or. So now if. If you look at the. Just the sort of the contents of these drop or sort or their behavior they're kind of difficult to distinguish from let's say a soccer updating program with just does the same thing goes to the Internet to download stuff and it executes it right but then if you look at actually if you try to look at this relationship file a downloads file D. right so from these relationships you reconstruct the download graph and so on each on each host you know there's a downward graph with. You know the WHO's are those who. It turns out that the properties of these graphs look very different for benign download activity and for malicious download activity I think so again here that the trick to to solving this this problem was to figure out that we need to look at these complex features that are graph based right that our drive from from the structure of these of these graphs to the structure and evolution so I again I'm not going to talk more about this this work I have a link here to the paper. And if you're interested you know ask me after the talk and I can provide more details. And then the third example. Which will be. I'll use this as a running example in this talk is Andrew Mauer detection All right so again here the question is how should we compare samples What does it mean for two to underage samples to be similar and you know the earliest Android now are detectors just look at the permissions right and this worked initially fairly well because. Android malware needed to request certain permissions that were essential for the for the behavior for the you know functionality. But then these early. Detectors became less and less effective as now would evolve and if you think about it the permissions just indicate. The privilege of an application but not the actual behavior. And also if the. If the explosive the we're performing the privileges collation exploit then it doesn't actually need to request permissions. So then you know the second generation of foundry Mahler detectors used for using if your method calls as its features for for comparing samples so I give you these three polls but the point I'm trying to make here is that. In order to. Detect malicious activities malware malicious instances. You really need to think about really carefully about what are the semantics of the threat and that you're trying to detect and whether you know the features that you're going to use that the detector is going to use are actually related to the semantics that is the example here you know with the malicious behavior versus the privilege of the of the application. And so. This star is called feature engineering and in our experience is the most time consuming. Step of any machine learning projects are coming up with. Good features so how do we do this in security Well we did papers with industry reports there's a very large volume of information that is published about security if I the volume of information is so large that it's kind of hard to to keep up with it so if you just look in Google Scholar for papers published on malware you get. Over one hundred thousand hits. For intrusion detection over six hundred thousand I So who's going to read all these papers right. So the challenge here is that it becomes difficult to assimilate all the relevant knowledge shoulder knowledge that might be important for for the machine learning task that you're trying to implement. So then here's the dilemma on the one hand we have a lot of people working security a lot of information good information about attacker behavior Center attacks and how to detect them but of on the other hand this is a growing body of not knowledge makes it difficult to engineer good features and so then. What we asked in this project is Kenny can we sort of. Turn this growing body of knowledge to our advantage so can we engineer features automatically by mining all this all this literature So in other words can we create an artificial intelligence that not only learns from examples. But he can also help us build other intelligent systems. And to do that we must first understand. How security threats are described in these in these papers so let's take a look at a few examples. So. This center says this so now there is designed to send S.M.S. messages to certain premier numbers so a security analyst reading this sentence would conclude that this smell were what it does it does a service fraud right this is the most activity but know that this conclusion. Is based on common sense it's not. Based on any linguistic clues that appear in this in the sentence. I listening to look at another example. This one says the intense battery charge battery change action and the R I used to trigger to start the malicious for service in the back row so here the sender Stella's you. What the what the malware is doing. But you need some understanding of Android programming to know that infants are under it intense you know this is a mechanism that allows you to register callbacks in you know to start an activity with a specific event occurs. But so again you know really understanding what the sentence means requires some some background you know some took down some security knowledge. And in this third example says Ginger master is often bundled with benign applications and tries to gain root access. In here. Where they appear so this is the the paper. This is no no no this is from a research paper published in the security conferences Yes. So in this talk I'm only going to I'm only going to talk about analyzing research papers so the P.B.S. We get the P.B.S. and we extract right it's. Yes So for this one the citation is here this is. The Durbin paper. Yes So the point here is that. Again to understand really the meaning of the center you need to know that bundling is done for the research purposes it looks you know it makes the application look more benign or look like a benign application. And privileges collation that you know gaining root access is basically a form of privileges collation it so. Understanding what the semantic meaning of. Of these the sentences used to describe security threats is is our first challenge this is a broader challenge in the field of of natural language processing called common sense reasoning. We need to sort of reach you know in. Interpret the sentences you know extract the meaning you know by using some comments common sense and knowledge of of the security domain. So this is a pretty broad sort of generic challenge is one of the biggest challenges in language processing but here we have another challenge which is more specific to the security domain round going to illustrate this with this blog. That shows the number of unique words that appear the cumulative count of unique words. From papers published in Tripoli security progress the symposium also known as Oakland This is the flagship Security Research Conference over the years right so this is a cumulative thing trying haha how do you guys think this should look I'm curious yes it goes up it's clearly it's has to increase right. Exponential. OK. What are the then e sort of natural language processing people or machine learning people in the room. What would you think this would look like but you think. Well what does that look like so it's going to go down and it's increasing it has to increase. So you see it goes up and satirists so it goes like this right right and then you say it's exponential right so we can finally. I mean kill or be exponential race if you have a comparative number of papers like how how many new words can you. Well OK So it turns out it's actually it's actually constant It's a constant growth rate right so basically this means that every year we have a constant number of new terms. That are included in these papers and that we somehow need to interpret it and end the reason behind this is you know the security arms race so we come up with defenses. Attackers invent new attacks two in new ways to bypass the defenses right so this you know you know leads to. New terms you know being used to describe these new attacks and you know these new concepts that are important right but this has an important placation for for analyzing this language this text which is that we cannot rely on natural language processing techniques that try to match the text to a fixed set of concepts fixed set of of terms right so instead we must somehow. Find a way to discover open ended behaviors open ended you know now or behaviors from from the language from the text OK for me. That's great. So. I'm going to give you an intuition for how this might work. So like I said we mine. We my research papers. That typically propose some features for detecting malicious Android apps. And these papers typically have a section that explains their feature engineering and they would say things like we use so these are E.P.A. calls and read a A.P.I. call so we use the good device ID and get subscriber ID calls us features because this a low and apt to access sensitive data. We use these other two because it allows they allow the app to communicate over the network and we use this runtime exec because it allows an app to execute external commands so note that these are. Mellow behaviorists more abstract melody behaviors but described using words that an analyst would use these are described in human language so it seems that in order to understand what these features actually do what their meaning is and whether they are useful form our detection we need to discover these now or behaviors and link them to these behaviors statements first. So this is what we try to do so the first thing that our system does is behavior extraction. And behavior here means a brief description of the malware activity such as access sensitive data or execute external command. And it is a short phrase we define it for the purposes of this of this work as a short phrase. That consists of a subject a verb an an object where either the subject or the object may be missing. And. We did we detect these patterns in natural language by analyzing the grammatical structure of sentences. Specifically you we use a type of dependency parser so let me show you how this works. So let's go back to the sentence that talks about the malware that sent us a mis messages to premier numbers so the first thing that we do is part of speech tagging which means we determine which word is a verb which is a noun which is an adjective. And. Then we. Establish these a dependencies so these are directed links between two words that. Indicate there are a grammatical sort of relationship to. The medical dependency beat between the words so for example the red edges here show you know the. Direct object relationship between the verb and its object so we have a number of these type of dependencies in from D.S. we extract the. Subject verb object pattern that that we are looking for so from this sentence we extract. Five. Five behaviors so basically what this allows us to do is to. Break down a long sentence into shorter statements that have a single meaning so these are the these are the. These are the behaviors that we're looking for. Now we extract. All the behave all the or all of this part of we start a lot of these. We still don't know what they mean so in order to figure out what they mean. We first try to link them to come. Features they can be extracted directly from a malware sample using for example start a static static analysis so these will be things like A.P.I. and with A.P.I. calls or permissions or intents. And we create a link between behavior such as accessing sensitive data and a feature if we find them close by in text in a sentence and intuition behind this this goes back to some research in cognitive psychology that says that when humans describe something. They tend to mention semantically. Similar terms at first and then increasingly they mention increasingly less relevant concepts so basically the proximity in the sentence actually indicates that there is a semantic connection between these things so here we specifically look for. For. Concrete features and for behavior so we consider that the connection. Basically is that this this is the this is this feature of allows you know expresses this this sort of behavior. And then we do the same thing for linking behaviorist to actual malware So again if we find. A behavior to you know close by in text to a known our family name or the word malware or it's synonyms then again we create a link between these. So. These links and these nodes. Define what we call a semantic network. So. Surmounting that works are a more general concept in natural language processing but in this work we have a semantic network with just three types of nodes. Or families and concrete features which are named entities in the behaviors which are are open ended and the behaviors link them our families to the to the concrete features. In another thing that we do is we. We derive weight for each edge based on the distance in the core Cokers frequency of of the two nodes. And basically this allows us to infer how close you know the semantic connection is between between the concepts so this this plot here this image here shows a fragment of our of our semantic network so we have the malware nodes on the left side in the middle of behavior nodes and the feature nodes on the right side. And we start from so we create these edges. And then what another thing that we do is we start from them our nerves and then we propagate these weights along the edges. And in we. In all the way to the to the future right so know that some of the features here will end up not being connected to any to any malware nodes so this is just there there are probably less useful for detecting the mouse less relevant to meld behaviors. While the ones that are connected through a behavior node are probably relevant to our detection. Berthing that we do is that at the end we wind up with these weights on the on the. Future nodes and this allows us to rank them to determine year which features are most relevant for from our detection according to the literature that we have analyzed. OK. So this is. This is what the our system looks like we call it feature Smith. And. We start so we start from so the inputs are you know the papers scientific literature the Andrew documentation developer docs and a list of our family names so we don't at this point we don't use any actual samples we just need the names. And then from the list of names we create. The Balor nodes from the underdogs week struck the features and then from the scientific literature. We stacked the behaviors and then also by mining the scientific literature we can strung the links between the three types of nodes. And we derive edges and we also do this weight propagation. And then there's another cool thing that we can do at the end. So basically give this a lot of this this is future engineering right so in the end we have a list of features that we can use. For training in machine learning that there but there's another thing that we can do that is pretty cool. Which is that we can actually you know traverse the knitwear backwards starting from the features to the behaviors and this allows us to generate explanations so explaining. Machine learning outputs is the one of the big challenges in in machine learning a lot of models are not explainable. And what we can do here and I'll show an example later on is that you know if. If a detection is due to certain features of the sample Well we can explain. In human terms what behavior this corresponds to and we can also put a link to the papers that that made that claim that there's a link between the future and the behavior. Part So this is how the system works in a nutshell. And the question is how well does this work OK so with features Smith we analyze one thousand security papers. And we automatically in engineered one hundred ninety five features that are that are technique considered to be relevant two hundred mile where. So that's roughly half of the features that he found in the papers you consider that they're not they're not relevant I'm hour. From we're going to compare this to drive in. Which is a state of the art Android malware detector. It uses a huge feature set over five hundred thousand teachers. And they include a list of three hundred fifteen A.P.I. calls and A.P.I. calls that they consider malicious and this list was manually curated right so this is this this was the future engineering effort in that in a project from from the over twenty thousand A.P.I. calls. In Android they decided these three hundred fifteen are the ones that we should look at it's OK so what we actually want to compare is not do a direct system comparison but we actually want to compare the effectiveness of the feature sets right so so these features that we that we engineer with with the. With the job in features with the five hundred thousand features that driving us. OK so. We're going to compare feature sets and we're going to use the same classification algorithm so in this case we're going to train random forests we're going to use the same corpus of benign and malicious Obs So here we actually use the samples but we only use it for evaluating right so the the actual feature engine is only done based on. Based on the literature based on natural language. And the same feature types so we look at. Permissions and rate A.P.I. calls an android intense. And then we're going to compare the performance of these. Of these algorithms that use the two. Different feature sets we're going to do an apples to apples comparison. So the first observation is qualitative which is that. Even though Drebin has this huge feature set five hundred thousand features we still discover some new ones. So pretty good these three features. They are missing from from the manual engineered feature sets from Drebin. But it turns out that they are often used by malware any particular there is one of our family called Gap listen that if you read the dragon paper it says this is something we cannot detect right because they said basically acts like a downloader so it doesn't look malicious So it turns out that that family actually uses one of these in our to take a can detect it. So and the point is that. You know there's this point I made a little bit earlier that you know human data scientists it's really difficult to assimilate all their over knowledge all the knowledge from those one thousand papers right so if you're going to engineer a feature set manually you're probably going to miss some you know some important feature. So now let's take a look at the actual performance detection performance comparison. So both systems have a pretty good detection so I'm going to zoom so this is a rock curve rock lot receiver operating characteristic it shows the tradeoff between the false positive rate on the X. axis and the true positive rate on the Y. axis. And you know there's always a tradeoff so you can always reduce one of them increasing the other one so the point is that you have some some curves typically here and. If one curve is above another one then that detector is better right and the ideal point is this one where you have zero false positives and hundred percent true positives and if your detector is on the. On the. Going all done it's no better than just a random E. random. So this is Robin So like I said you know it's pretty good so that's why I'm zooming in on the on the upper left corner. And. This is features right so it looks almost the same price almost the same performance and in fact if you look at the so the security papers we tend to like to report what happens at the one percent false positive rate. And they actually have exactly the same performance ninety two point five two positives. So that's the by the fact that Drebin uses this huge feature set. Peter Smith only the classifier uses only the feature Smith pacifier uses only one hundred ninety five features. And I think the thing that we thought might be cool to investigate is OK how does this knowledge evolve over time so then we trained classifiers using features that were extracted only from papers published before before twenty twelve before twenty thirteen twenty fourteen twenty fifteen I sort of always you know the performance can only increase right because you will you. You know it's cumulative vary so you take into account the previous papers so it's kind of interesting to see that there was this sort of sort of big jump in the in defectiveness that we see in. In twenty thirteen. And then we start seeing diminishing returns so when I show these people ask me you know there's this does this imply that people publishers the stuff it doesn't quite say that and here's the reason why. This is the detection performance for a fixed problem we use the same. The same set of malware samples which I think were collected up to twenty twelve right so this is the performance the detection performance for twenty two of malware. Right. So you know if it's a fix problem it's not that surprising that over time you kind of find you can learn everything there is to know right about the problem so that and you get diminishing returns believe me these papers actually looked about. About a newer sort of Mauer behaviors so they were they were more for focused on on the newer you know types of malware that appeared after twenty twelve you know that I think that that would be an interesting hypothesis which is then have a corpus to test. It's. Let's see so I also told you that if you just missed out with a ranking of the features according to how relevant they are to malware detection based on the literature that. Analyzed So we wanted to see how good this ranking years and we compared it with a turn frequency rate ranking So this is based This is a common sort of. Metric used. In information retrieval it basically just looks at how you know how common you know how common and certain features are. Are mentioned in. In the research papers. And. So you can see that this shows them usually in the cumulative mutual information so we basically add the mutual information of each feature on top of the on top of the previous one and you can see that we can all always get more mutual information from future Smith than there from term frequency. But then what we did we also compare this with the actual ranking based on mutual information so this was how useful the features were for separating malware and benign. Samples on our ground truth right so how much does each feature reduce the uncertainty about the class of. The sample and then we did rank correlation test so. For. For see we found that they are not correlated musically. But for for the future Smith the ranking we actually did find is that the steeply significant correlation between the ranking based on the literature and the ranking that we observed on our malware dataset. We also have some false positives. But after we looked into those false positives it turned out that many of them at least of them actually have at least one detection from Virus Total So they may not actually be that that benign and also some of them are true false positives but these are security absolute parental supervision apps that you know do a lot of the things that the Maori does so. And I told you that a feature Smith can help with we're generating explanations for machine learning outputs so typically the work on expert explainable machine learning focuses on identifying what features. Contributed to each prediction right. So but but sometimes these features are still you know not still a bit too ambiguous right so they don't they don't. Indicate what was the semantics behind that prediction so let's say that we say that we have a sample that were detected because because it invokes this function get network operator operator name right so what does this mean right so for an analyst seeing this may want to see more information so what we can do is we have our semantic network so we traverse it back we start from the features going back to the behavior nodes so this this A.P.I. call is linked to behavior no the say sent a militia server the network operator name and then we can also prison. Snippets from the paper which actually into the sentence is that from from which we started disconnection right and here it is it actually in we actually put a reference to the actual paper. So you know it presented with with this analysis you know can make sense of you know OK this is suspicious because it sends. Data to the network operating operating name to it may send it to a malicious server so basically it presents the semantics behave behind the concrete feature that detector used so before I conclude I want to mention were the alternatives to feature engineering So first we could do feature selection so this works in situations that. Where you can enumerate all the possible features in advance or let's say you have a list of the Android permissions all the hundred permissions and then you compute the mutual information or some other measure of future utility. Now the problem is that this will work for the common to talk patterns it will ignore the uncommon once. When a representation learning is a way to discover useful features from raw data this is one neural networks do so again this avoids feature engineering. But the challenge here is that you won't get these complex or domain specific features that I mentioned. In the beginning of my talk in some general disadvantages for for both approaches is that they are data driven so they will tell you how useful these features are for detecting now or from a specific ground truth from a specific dataset right and they may reflect biases in that ground truth I save that that is old it doesn't have new behaviors the features that represent those new now behaviors will not be selected. And it does not discover thread semantics automatically that. Yes. That's lot more recent to me. So it's only good with. The social disadvantage so our stuff is a little bit old right. Actually have. So for follow on work looking at blogs and also not just blogs but also because if you think about it this is the voice of the defenders read the papers we also look at underground forums and paste bin and stuff that's actually used by attackers So that's the voice of the attackers right so it's interesting to see what they describe as their problems and how they solve their problems in bypassing existing security differences yes. Yes Well right. So. You're. Right either that's true so if we do not find good behavior is then we we might not be able to identify the meaning of those features that's that that's absolutely true another sort of more interesting question is why does this work at all because you know there's a limited number of features that you can put in a paper in a blog post right so even Drebin the driven paper didn't list all five hundred thousand features that they have right but it turns out that when you know author is highlight features they tend to start from the best once they were they can explain. Anything as all of this is enough this these are really for for Probably this wasn't at all obvious when we started the project. So. Today you know many people are worried that automation in artificial intelligence with end up killing lots of jobs right so I just want to say that I don't think that teaching will put data scientists out of out of a job. And there's a labor deficit projected but more importantly this doesn't actually do the same thing it complements what they'd are scientists to do it doesn't replace them so specifically human need our scientists have intuition which is not something we have figured out how to simulate it. Was just myth can reason over a huge body of knowledge I saw there are scientists use feature Smith as a tool for discovering useful features. And. More importantly I think that this industry is the Prague promise of broader techniques for automating the discovery of threats amount so in the few minutes that they have left I just want to give you sort of a vision for what automatic for discovery could allow us to do. And I'm going to I'm going to mention for potential directions. So. One of them is this direction of explaining the outputs of that the stickle detectors right so in general when we use machine learning security many of these are type detectors or black boxes they say I detected that this is knowledge that they don't explain why in this there is some work showing that this actually has slowed down the adoption of machine learning in the security industry. Because security analysts are are uncomfortable with adopting something when when they don't know how it works right so that you don't know if the detection was based on some meaningful feature or some artifacts or that he does it. Right. So. If you just me it allows us to link these features to. The behaviors from from the papers right so this brings back some some semantic meaning to the features and you know the example they gave you would feature Smith you know if you say see them our sample is detected because it uses a good operator name Art you get a snippet of text that explains how Mauer likes to use this. A.P.I. call so the question here you know can we use this style of of explanations to bridge the semantic gap between statistical inference is and the mental models of of security analysts by providing explanations in a language that is colloquial and it's how an analyst would describe. You know the reason why why this this feature was was used in the detection. So that that's one potential. Implication another implication is discovering new threats so when when I presented this work somebody invariably asked so OK but what you do is you just minute you know existing known features right so these are already in the in the papers and. And yes you know what why is this why is this new right so they already know. But cementing that works basically. You know help us connect the dots between concepts that sometimes that appear in different in different documents and in particular if you look at these new features that were missing that I told you about that were missing from the German features that. They actually came from papers that talked about privacy. That's so they were not necessarily papers that the malware detection community might might be interested in so this knowledge may be sort of ignored because of that. But more importantly the links. Between different concepts may be may be important so this came up especially when we tried to explain expand our work to. Two mining sites popular with hackers. Trying to understand the semantics of these malicious activities. So for example if you want to characterize malware campaigns. Nowadays there are a lot of threat intelligence companies that will. Provide feeds of indicators of compromise of you sees that correspond to different thread doctors you know to the Russians or the Chinese or whatever. But the police of these O.C.S. are indicators so there are things like IP addresses or hash is domain names right but again you don't know exactly what they are in particular you don't know what the role they have in the campaign so for example domain name can be. The site that performs fishing or the site that is a command and control known or the site that is hosting and exploited threats so these are very different stages in a TOC and figuring out you know again by by mining by by mining the natural takes not natural language text what is the role of an indicator. In the campaign we can see that are the same campaign use this domain and then it is a different domain these two domains are probably part you know. They are behind the same sort of attack pattern in the same with exploit weaponization So if you look at the exploits that are seen in the wild they are different from the proof of concept exploits that are. Included in for example explained to be because attackers have to add things like platform fingerprinting they have to improve the robustness of these exploits they have to add more malicious speedo's right so understanding the purpose of all these coach changes can give us sort of a glimpse into into you know. The tasks that they're trying to achieve rights of the challenges that they haven't how the how they solved. Prediction is another another interesting direction so. We write so it's somewhat counter-intuitive Why how can you predict security events that because in security we're dealing with an intelligent adversary who will do exactly the opposite of what we want him to do. But it turns out that today map malicious code is has a lot of specialised components and usually Coming up we're weaponized malware code you know exploit is beyond the skills of a single usually beyond the skills of single of a single. Actor So typically these these are developed in a collaborative fashion there's a lot of discussion on these forums on how to weaponize exploits for example and we can analyze this discourse to forecast attacks I briefly mentioned our our our work on predicting which vulnerabilities are going to be exploited. By by mining Twitter. Case we had a two day median lead time of detection compared to the creation of signatures for detecting those those exploits. In the interesting insight there is that we actually need both classes of features so the the features that we extract from Put are give us high precision low recall and the features that we struck from from the vulnerability character sees themselves give us high recall and low precision so basically we need both kinds of features in order to do a good a good prediction. In other potential direction is generating exploits and cyber deceptions automatically because so far I just talked about understanding and predicting stuff but recently there were a lot of advances in automatic vulnerability discovering exploit generation for example in the cyber genome project these things are very powerful but the exist wants them to focus on memory corruption exploits so the question here is you know can if we mine the underground discourse we understand what are the challenges for the attackers and how these songs all these challenges. You know can we generate expose for for a broader class of vulnerabilities. Or can we generate effective cyber deceptions for example by making it look like a system is vulnerable to a specific attack a specific exploit. It but the system actually is not and you know doing this with the purpose of shorting platform fingerprint. So. I'm going to wrap up now. I told you about of our semantic network design which is a flexible representation of the security knowledge. In our system can discover open ended mounted behaviors and I also describe automatic feature engineering which is a method for discovering semantically meaningful features some of them were missing from a manic curated that are certain the performance of the automatically engineers feature set is comparable to that of a state of the our our detector so we actually have released our semantic network and. All the features that we engineered at the site you can you can check it out we hope that other people will. Will build an hour. Work. And before I conclude I just want to leave you with two thoughts. So the first thought is that automated systems can understand the semantics of security concepts. And the other one is that this is a powerful tool for creating attacks and offenses. Thank you for your attention. But. It's. Yours. Yeah. Yeah yeah so. I played with a. Group of papers published for two twelve to thirteen right. But I don't. Know if the question I mean we can look into that yeah. Yeah that's right. That's. Right explicitly specific you read. Them to be helpful for one sample. I think the interest. Just. So we had in our automatically engineers feature set we had features. That did not show up in the malware ground truth and they were curious so we looked at you know what are these things right so one of there was is music active so that's that's one A.P.I. call from Android and we want to OK what why is this useful for formalwear. And you know there's this paper that shows that it actually can be used as a site channel to leak location information if you are driving while listening to music. Then google mouse will interrupt the music whenever it's telling you to turn left or right and it turns out that just knowing the timing of these interruptions if you have enough it's probably enough to locate you know somebody. No matter what use is this in the wild but this doesn't mean that they can't write so this is this is a disadvantage that I mentioned that you can actually find. You can find features that are not present in the data so these are features that are useful according to what the security community thinks Not according to that particular data. Yes. But by the outline you mean. Yeah. We we build ourself so we started we had some you know conference proceedings we looked in the center as many as we could find in then ultimately we actually. Sign an agreement with the A Tripoli and they give us a feed of all the papers that are published here in Tripoli so that we can mine them so we can you know we got this feed weekly on a weekly basis. So right so because you you brought this this outlying thing up I also want to mention that. This literature base discovery is established in the biomedical engineering field. And the benefit there is that they can usually get away with just analyzing the abstracts from problem and for example and that's because in that field people probably structured abstracts where we have a very strict structure and we don't in security and I guess in computer science in general we don't tend to publish structured after abstracts so that's why we actually need to parse the entire paper and eat it you know there are some complicated sort of issues related to parsing Pedia slake good information using tables and it's hard to you know pull it out then you know we had to take care of those issues as well. Who does it. Make. Sense. Or. It's. Why. So that we have a thought. But it's a good idea so that we can't let you write so we can we can we can probably apply this to. You know many different problems many different types of literature so you know one that I heard is mining protocol specific ations to try to figure out the status of for you. So are of seas mining or of seas but. Why not mine. Legal documents court sort of documents Yeah sure I think. You know I think it's an interesting interesting direction in the pursuit. It's. No right so we you know if there is no connection at all then we say it's not related if there is a connection by the. You know the connection appears in only one paper. In that case you know we would say that it is connected but it would be lower in the ranking that that we are with at the end. Of the question. Who is. Prince. Or system. Right. Here so. Yeah very interesting monetization if you have some thoughts. So I mean this is. The word I describe here is a one shot thing so we analyze this corpus of papers and we've published our data right. So you know everybody can get it in our foreign work. That on. The mining underground sites then defying the role. For example I was seizing a campaign there we're actually looking at setting up. A Web site that is updated periodic I think it's actually it's live I mean I don't have the link here but I mean we already set it up but. Yeah that that work is still is still a bit ongoing right so we can identify the roles pretty well for a specific type of so for our distribution campaign. But we want to see if we can basically more generalize this one to other types of other types of threats and campaigns. I think so so I think that this work that we're doing with. Exploit weaponization So the way we got here is this observation that sometimes. People with twitter hash or will tweet a short code snippet later on will. End up in some sort of an exploit So this was the case for example with the. Exploit that was used in the twenty eleven attack against our A C. which reportedly resulted in stealing seeds for the artists sort of tokens so they had to record of one of tokens at the time. And. And yeah so it turned out that that you know very specific. Code snippet was treated maybe I think a month or a few weeks before the attack before the spear phishing e-mail was sent to our say so. And you know buying Brian Krebs has a very nice article about it who who was the guy who tweeted and why. But but yes so we believe that there are these sorts of things right that you know are out there and this is what we're trying to do with the weaponization projects of trying to figure out what is the purpose of these communications of these of these codes that they're exchanging that they're modifying You know why are they modifying it right so. They are so that's that ongoing work. There. So I say that this paper that they presented is a non-recurring thing I would say that in general of all my work. So. I I used to work for Symantec in my job at Symantec was to build this wine platform that shared data with researchers in academia it was basically one of the earliest read intelligence platforms. And you know the point there it was that it was an ongoing process it was constantly obviously don't just dump you know one time sort of their set but it's something that people could use on an ongoing basis and it actually had a pretty nice impact in terms of the papers that that were published based on on that data but but yeah so this is this was my experience before going back to academia and as an academic I actually strive to you know. Build stuff that lasts.