A lot of them were trying to work.

With us on the streets last where
he was one instrument one for

one one for
find useful things with the data and

he said or
is that more than you need to see the.

Words features of the old do
you want security conference.

Some of the words on the MIT told you
Forbes and so on the right side of.

This room was right the security
interests were to take my.

Place Where think you for
this very sort of nice introduction.

This is the first time that
I am a Georgia Tech and

I think also the first time in Atlanta so
it's it's I'm

very excited to be here so I'm going to
tell you about some some recent the work

that we did last year it's so
it's fairly recent and

basically what this talk is about I'm
going to tell you how to build a system

that learn to detect malware by reading
and understanding security papers.

So.

A lot of security problems basically
involve distinguishing between

malicious and benign entities instances or
or artifacts and

now these we use machine learning
a lot for for this problem.

This actually Professor Winkie Lee
did some of the seminal work in this

area both fifteen years ago
more than fifty years ago but

now Machine learning
is widely used in the.

In the security industry I know
because I actually work there so

basically what machine learning does
it starts it learns from examples so

it starts with a few known benign and but

we can you guys see from that
side of the room OK cool so

machine learning starts with a few known
benign unknown malicious examples and

then it tries to classify the rest of

the examples according to how similar they
are to these sort of initial instances so

now that they don't have to be identical
The point is to detect some new stuff

right that is somewhat similar but
not exactly the same as before and

this general approach is used for
detecting spam and

mouse where network attacks for predicting
which vulnerabilities are exploited

in the wild which swept sites will
become malicious in the near future for

predicting data breaches in
probably many other problems and

in all these problems the fundamental
question is What does it mean for

to sample is to be similar right.

So let me explain this more clearly with
a few examples I'm going to give you three

examples where I'm going to illustrate the
importance of of this you know how to how

do you determine similarity So the first
two examples are from from my own work

we we use a lot of machine
learning prior to this work we

did a lot of machine learning projects for
for security so

in one of them we actually tried to see
if we could predict which vulnerabilities

are going to be exploited in the wild
by mining the Twitter stream right so

there's lots of lots of things trying to
predict various events using Twitter.

The stock market movie
revenues flu trends etc are.

Typically Twitter analytics
involve analyzing the content.

The of the tweets themselves as
well as some features of the.

Of the users So how popular they
are how influential they are.

Right so so basically features of you know
that that come from from Twitter itself so

what we found in this
project is that this actually

doesn't work well enough for predicting
vulnerability exploitation we also

need to look at the characteristics
of the vulnerability itself right so

if this is for example a remote code
execution vulnerability a privilege of

own ability a denial of
service vulnerability and

this matters because attackers
you know are more interested

in a code execution vulnerability than
in a service honorably for example.

And also the C.V.S. The score is
a numeric sort of measure of how

severe the vulnerability is so
this also turns out to be important so

really the key here is that in
order to do this work you need

some domain specific features the features
of the vulnerable itself to come

from outside of Twitter so I put here this
leak I don't plan on talking more about

this paper if you're interested you can
check this link and or or ask me about it.

Later on so that's the first
example in the second example.

We try to detect malware
delivery on the client side.

So what this means is that we want to or
look at

what files are being downloaded on the
host and in particular who downloads them.

There are many attacks that basically
many pieces of malware to download

additional payloads from from the Internet
either perhaps because it's the next stage

of the exploit or in some cases these are
just generic droppers the just distribute

malware you know unrelated you
know this is their business they.

They just distribute now or.

So now if.

If you look at the.

Just the sort of the contents
of these drop or

sort or their behavior they're kind
of difficult to distinguish from

let's say a soccer updating program
with just does the same thing goes to

the Internet to download stuff and
it executes it right but then if

you look at actually if you try to look at
this relationship file a downloads file D.

right so from these relationships you
reconstruct the download graph and so

on each on each host you know
there's a downward graph with.

You know the WHO's are those who.

It turns out that the properties of
these graphs look very different for

benign download activity and for
malicious download activity I think so

again here that the trick to to solving
this this problem was to figure out that

we need to look at these complex features
that are graph based right that our drive

from from the structure of these of
these graphs to the structure and

evolution so
I again I'm not going to talk more

about this this work I have
a link here to the paper.

And if you're interested you
know ask me after the talk and

I can provide more details.

And then the third example.

Which will be.

I'll use this as a running example in this
talk is Andrew Mauer detection All right

so again here the question is how should
we compare samples What does it mean for

two to underage samples to be similar and
you know the earliest

Android now are detectors just
look at the permissions right and

this worked initially fairly well because.

Android malware needed to request certain
permissions that were essential for

the for the behavior for
the you know functionality.

But then these early.

Detectors became less and
less effective as now would evolve and

if you think about it
the permissions just indicate.

The privilege of an application but
not the actual behavior.

And also if the.

If the explosive the we're performing
the privileges collation exploit then it

doesn't actually need
to request permissions.

So then you know the second generation
of foundry Mahler detectors used for

using if your method calls as its
features for for comparing samples so

I give you these three polls but
the point I'm trying to make here is that.

In order to.

Detect malicious activities
malware malicious instances.

You really need to think about really
carefully about what are the semantics of

the threat and
that you're trying to detect and

whether you know the features that you're
going to use that the detector is going to

use are actually related to the semantics
that is the example here you

know with the malicious behavior versus
the privilege of the of the application.

And so.

This star is called
feature engineering and

in our experience is
the most time consuming.

Step of any machine learning
projects are coming up with.

Good features so
how do we do this in security Well

we did papers with industry reports
there's a very large volume of information

that is published about security if
I the volume of information is so

large that it's kind of hard
to to keep up with it so

if you just look in Google Scholar for
papers published on malware you get.

Over one hundred thousand hits.

For intrusion detection over
six hundred thousand I So

who's going to read all
these papers right.

So the challenge here is that it becomes
difficult to assimilate all the relevant

knowledge shoulder knowledge
that might be important for for

the machine learning task that
you're trying to implement.

So then here's the dilemma on the one hand
we have a lot of people working security

a lot of information
good information about

attacker behavior Center attacks and
how to detect them but

of on the other hand this is
a growing body of not knowledge

makes it difficult to engineer
good features and so then.

What we asked in this project
is Kenny can we sort of.

Turn this growing body of
knowledge to our advantage so

can we engineer features
automatically by mining all this

all this literature So in other words

can we create an artificial intelligence
that not only learns from examples.

But he can also help us build
other intelligent systems.

And to do that we must first understand.

How security threats are described
in these in these papers so

let's take a look at a few examples.

So.

This center says this so
now there is designed to send S.M.S.

messages to certain premier numbers so
a security analyst reading this

sentence would conclude that this smell
were what it does it does a service fraud

right this is the most activity but
know that this conclusion.

Is based on common sense it's not.

Based on any linguistic clues that
appear in this in the sentence.

I listening to look at another example.

This one says the intense battery
charge battery change action and

the R I used to trigger to
start the malicious for

service in the back row so
here the sender Stella's you.

What the what the malware is doing.

But you need some understanding of
Android programming to know that

infants are under it intense you know
this is a mechanism that allows you to

register callbacks in you know to start
an activity with a specific event occurs.

But so again you know really understanding
what the sentence means requires some

some background you know some took
down some security knowledge.

And in this third example says
Ginger master is often bundled with

benign applications and
tries to gain root access.

In here.

Where they appear so
this is the the paper.

This is no no no this is from
a research paper published in

the security conferences Yes.

So in this talk I'm only going to I'm only
going to talk about analyzing research

papers so the P.B.S. We get the P.B.S.
and we extract right it's.

Yes So for
this one the citation is here this is.

The Durbin paper.

Yes So the point here is that.

Again to understand really the meaning
of the center you need to know that

bundling is done for
the research purposes it looks you

know it makes the application look more
benign or look like a benign application.

And privileges collation
that you know gaining

root access is basically a form
of privileges collation it so.

Understanding what
the semantic meaning of.

Of these the sentences used to
describe security threats is

is our first challenge this is
a broader challenge in the field of

of natural language processing
called common sense reasoning.

We need to sort of reach you know in.

Interpret the sentences you know
extract the meaning you know by using

some comments common sense and
knowledge of of the security domain.

So this is a pretty broad sort of generic
challenge is one of the biggest challenges

in language processing but here we
have another challenge which is more

specific to the security domain round
going to illustrate this with this blog.

That shows the number of
unique words that appear

the cumulative count of unique words.

From papers published in Tripoli security
progress the symposium also known as

Oakland This is the flagship Security
Research Conference over the years right

so this is a cumulative thing trying haha
how do you guys think this should look

I'm curious yes it goes up it's
clearly it's has to increase right.

Exponential.

OK.

What are the then e sort of natural
language processing people or

machine learning people in the room.

What would you think this
would look like but you think.

Well what does that look like so

it's going to go down and
it's increasing it has to increase.

So you see it goes up and satirists so
it goes like this right right and

then you say it's exponential right so
we can finally.

I mean kill or be exponential race
if you have a comparative number

of papers like how how
many new words can you.

Well OK So it turns out it's actually

it's actually constant It's
a constant growth rate right so

basically this means that every year
we have a constant number of new terms.

That are included in these papers and
that we somehow need to interpret it and

end the reason behind this is you
know the security arms race so

we come up with defenses.

Attackers invent new attacks two in new
ways to bypass the defenses right so

this you know you know leads to.

New terms you know being used to
describe these new attacks and

you know these new concepts
that are important right but

this has an important placation for
for analyzing this language this text

which is that we cannot rely on natural
language processing techniques that try to

match the text to a fixed
set of concepts fixed

set of of terms right so
instead we must somehow.

Find a way to discover open ended
behaviors open ended you know now or

behaviors from from the language
from the text OK for me.

That's great.

So.

I'm going to give you an intuition for
how this might work.

So like I said we mine.

We my research papers.

That typically propose some features for
detecting malicious Android apps.

And these papers typically have a section
that explains their feature engineering

and they would say things like we use so
these are E.P.A. calls and read a A.P.I.

call so we use the good device ID and
get subscriber ID calls us

features because this a low and
apt to access sensitive data.

We use these other two because it allows
they allow the app to communicate over

the network and we use this runtime
exec because it allows an app to

execute external commands so
note that these are.

Mellow behaviorists more abstract melody
behaviors but described using words that

an analyst would use these are described
in human language so it seems

that in order to understand what
these features actually do what

their meaning is and whether they
are useful form our detection we need to

discover these now or behaviors and link
them to these behaviors statements first.

So this is what we try to do so

the first thing that our system
does is behavior extraction.

And behavior here means a brief
description of the malware

activity such as access sensitive data or
execute external command.

And it is a short phrase we define it for

the purposes of this of this
work as a short phrase.

That consists of a subject a verb
an an object where either the subject or

the object may be missing.

And.

We did we detect these
patterns in natural language

by analyzing the grammatical
structure of sentences.

Specifically you we use a type
of dependency parser so

let me show you how this works.

So let's go back to the sentence that
talks about the malware that sent us a mis

messages to premier numbers so

the first thing that we do is part of
speech tagging which means we determine

which word is a verb which is
a noun which is an adjective.

And.

Then we.

Establish these a dependencies so these
are directed links between two words that.

Indicate there are a grammatical
sort of relationship to.

The medical dependency beat
between the words so for

example the red edges
here show you know the.

Direct object relationship between
the verb and its object so

we have a number of these type of
dependencies in from D.S. we extract the.

Subject verb object pattern
that that we are looking for so

from this sentence we extract.

Five.

Five behaviors so
basically what this allows us to do is to.

Break down a long sentence into shorter
statements that have a single meaning so

these are the these are the.

These are the behaviors
that we're looking for.

Now we extract.

All the behave all the or all of this
part of we start a lot of these.

We still don't know what they mean so
in order to figure out what they mean.

We first try to link them to come.

Features they can be extracted directly
from a malware sample using for

example start a static static analysis so

these will be things like A.P.I. and with
A.P.I. calls or permissions or intents.

And we create a link between behavior
such as accessing sensitive data and

a feature if we find
them close by in text in

a sentence and intuition behind this
this goes back to some research

in cognitive psychology that says
that when humans describe something.

They tend to mention semantically.

Similar terms at first and

then increasingly they mention
increasingly less relevant concepts so

basically the proximity in the sentence
actually indicates that there is

a semantic connection between these
things so here we specifically look for.

For.

Concrete features and for behavior so
we consider that the connection.

Basically is that this this is the this is

this feature of allows you know
expresses this this sort of behavior.

And then we do the same thing for

linking behaviorist to actual malware So
again if we find.

A behavior to you know close by

in text to a known our family name or

the word malware or it's synonyms then
again we create a link between these.

So.

These links and these nodes.

Define what we call a semantic network.

So.

Surmounting that works are a more general
concept in natural language processing but

in this work we have a semantic network
with just three types of nodes.

Or families and

concrete features which are named entities
in the behaviors which are are open

ended and the behaviors link them our
families to the to the concrete features.

In another thing that we do is we.

We derive weight for each edge based on

the distance in the core Cokers
frequency of of the two nodes.

And basically this allows us to infer
how close you know the semantic

connection is between
between the concepts so this

this plot here this image here shows a
fragment of our of our semantic network so

we have the malware nodes on the left
side in the middle of behavior nodes and

the feature nodes on the right side.

And we start from so
we create these edges.

And then what another thing that we do
is we start from them our nerves and

then we propagate these
weights along the edges.

And in we.

In all the way to the to
the future right so

know that some of the features here
will end up not being connected to any

to any malware nodes so this is just
there there are probably less useful for

detecting the mouse less
relevant to meld behaviors.

While the ones that are connected through
a behavior node are probably relevant

to our detection.

Berthing that we do is that at the end we
wind up with these weights on the on the.

Future nodes and
this allows us to rank them to

determine year which features
are most relevant for

from our detection according to
the literature that we have analyzed.

OK.

So this is.

This is what the our system looks
like we call it feature Smith.

And.

We start so we start from so
the inputs are you

know the papers scientific
literature the Andrew documentation

developer docs and
a list of our family names so

we don't at this point we don't use any
actual samples we just need the names.

And then from the list of names we create.

The Balor nodes from the underdogs
week struck the features and

then from the scientific literature.

We stacked the behaviors and
then also by mining the scientific

literature we can strung the links
between the three types of nodes.

And we derive edges and
we also do this weight propagation.

And then there's another cool
thing that we can do at the end.

So basically give this a lot of this
this is future engineering right so

in the end we have a list of
features that we can use.

For training in machine
learning that there but

there's another thing that we
can do that is pretty cool.

Which is that we can actually you know

traverse the knitwear backwards starting
from the features to the behaviors and

this allows us to generate explanations so
explaining.

Machine learning outputs is
the one of the big challenges in

in machine learning a lot of
models are not explainable.

And what we can do here and I'll show
an example later on is that you know if.

If a detection is due to certain features
of the sample Well we can explain.

In human terms what behavior this
corresponds to and we can also put a link

to the papers that that made that claim
that there's a link between the future and

the behavior.

Part So this is how the system
works in a nutshell.

And the question is how
well does this work OK so

with features Smith we analyze
one thousand security papers.

And we automatically in engineered
one hundred ninety five features

that are that are technique considered
to be relevant two hundred mile where.

So that's roughly half of the features
that he found in the papers you consider

that they're not they're
not relevant I'm hour.

From we're going to
compare this to drive in.

Which is a state of the art
Android malware detector.

It uses a huge feature set over
five hundred thousand teachers.

And they include a list of three
hundred fifteen A.P.I. calls and A.P.I.

calls that they consider malicious and
this list was manually curated right so

this is this this was the future
engineering effort in that

in a project from from the over
twenty thousand A.P.I. calls.

In Android they decided
these three hundred fifteen

are the ones that we
should look at it's OK so

what we actually want to compare is
not do a direct system comparison but

we actually want to compare the
effectiveness of the feature sets right so

so these features that we that
we engineer with with the.

With the job in features with the five
hundred thousand features that driving us.

OK so.

We're going to compare feature sets and

we're going to use the same classification
algorithm so in this case we're going to

train random forests we're going to
use the same corpus of benign and

malicious Obs So here we actually use
the samples but we only use it for

evaluating right so the the actual
feature engine is only done based on.

Based on the literature
based on natural language.

And the same feature types so we look at.

Permissions and rate A.P.I.
calls an android intense.

And then we're going to compare
the performance of these.

Of these algorithms that use the two.

Different feature sets we're going to
do an apples to apples comparison.

So the first observation is
qualitative which is that.

Even though Drebin has this huge feature
set five hundred thousand features

we still discover some new ones.

So pretty good these three features.

They are missing from from the manual
engineered feature sets from Drebin.

But it turns out that they are often
used by malware any particular

there is one of our family called Gap
listen that if you read the dragon paper

it says this is something we
cannot detect right because they

said basically acts like a downloader so
it doesn't look malicious So it turns out

that that family actually uses one of
these in our to take a can detect it.

So and the point is that.

You know there's this point I
made a little bit earlier that

you know human data scientists it's
really difficult to assimilate

all their over knowledge all the knowledge
from those one thousand papers right so

if you're going to engineer
a feature set manually

you're probably going to miss some
you know some important feature.

So now let's take a look
at the actual performance

detection performance comparison.

So both systems have a pretty good
detection so I'm going to zoom so

this is a rock curve rock lot
receiver operating characteristic it

shows the tradeoff between
the false positive rate on the X.

axis and the true positive rate on the Y.

axis.

And you know there's always a tradeoff so

you can always reduce one of
them increasing the other one so

the point is that you have some
some curves typically here and.

If one curve is above another one then
that detector is better right and

the ideal point is this one where
you have zero false positives and

hundred percent true positives and
if your detector is on the.

On the.

Going all done it's no
better than just a random E.

random.

So this is Robin So
like I said you know it's pretty good so

that's why I'm zooming in on
the on the upper left corner.

And.

This is features right so it looks
almost the same price almost the same

performance and in fact if you
look at the so the security papers

we tend to like to report what happens
at the one percent false positive rate.

And they actually have exactly the same
performance ninety two point five

two positives.

So that's the by the fact that
Drebin uses this huge feature set.

Peter Smith only the classifier uses only

the feature Smith pacifier uses only
one hundred ninety five features.

And I think the thing that we thought
might be cool to investigate is OK

how does this knowledge evolve over
time so then we trained classifiers

using features that were extracted only
from papers published before before twenty

twelve before twenty thirteen twenty
fourteen twenty fifteen I sort of

always you know the performance can only
increase right because you will you.

You know it's cumulative vary so you
take into account the previous papers so

it's kind of interesting to see
that there was this sort of

sort of big jump in the in
defectiveness that we see in.

In twenty thirteen.

And then we start seeing
diminishing returns so

when I show these people ask me you know
there's this does this imply that people

publishers the stuff it doesn't quite
say that and here's the reason why.

This is the detection performance for
a fixed problem we use the same.

The same set of malware samples

which I think were collected
up to twenty twelve right so

this is the performance the detection
performance for twenty two of malware.

Right.

So you know if it's a fix problem it's not
that surprising that over time you kind of

find you can learn everything there is to
know right about the problem so that and

you get diminishing returns believe me
these papers actually looked about.

About a newer sort of Mauer behaviors so
they were they were more for

focused on on the newer you know types of
malware that appeared after twenty twelve

you know that I think that that would be
an interesting hypothesis which is then

have a corpus to test.

It's.

Let's see so I also told you that

if you just missed out with a ranking
of the features according to how

relevant they are to malware detection
based on the literature that.

Analyzed So we wanted to see how good this
ranking years and we compared it with

a turn frequency rate ranking So
this is based This is a common sort of.

Metric used.

In information retrieval it
basically just looks at how

you know how common you know how
common and certain features are.

Are mentioned in.

In the research papers.

And.

So you can see that this shows them
usually in the cumulative mutual

information so we basically add
the mutual information of each feature

on top of the on top of the previous one
and you can see that we can all always

get more mutual information from future
Smith than there from term frequency.

But then what we did we also compare
this with the actual ranking based

on mutual information so
this was how useful the features were for

separating malware and benign.

Samples on our ground truth right so

how much does each feature reduce
the uncertainty about the class of.

The sample and
then we did rank correlation test so.

For.

For see we found that they
are not correlated musically.

But for for the future Smith the ranking
we actually did find is that the steeply

significant correlation between
the ranking based on the literature and

the ranking that we observed
on our malware dataset.

We also have some false positives.

But after we looked into those false
positives it turned out that many of them

at least of them actually have at least
one detection from Virus Total So

they may not actually
be that that benign and

also some of them are true false positives
but these are security absolute parental

supervision apps that you know do a lot
of the things that the Maori does so.

And I told you that a feature Smith can
help with we're generating explanations

for machine learning outputs so
typically the work

on expert explainable machine learning
focuses on identifying what features.

Contributed to each prediction right.

So but but sometimes these
features are still you know

not still a bit too ambiguous right so
they don't they don't.

Indicate what was the semantics behind
that prediction so let's say that we

say that we have a sample
that were detected because

because it invokes this function get
network operator operator name right so

what does this mean right so for an
analyst seeing this may want to see more

information so what we can do is we
have our semantic network so we traverse

it back we start from the features
going back to the behavior nodes so

this this A.P.I. call is linked to
behavior no the say sent a militia server

the network operator name and
then we can also prison.

Snippets from the paper which
actually into the sentence is that

from from which we started
disconnection right and

here it is it actually in we actually
put a reference to the actual paper.

So you know it presented
with with this analysis

you know can make sense of you know OK
this is suspicious because it sends.

Data to the network operating operating
name to it may send it to a malicious

server so basically it presents
the semantics behave behind

the concrete feature that detector used so

before I conclude I want to mention were
the alternatives to feature engineering So

first we could do feature selection so
this works in situations that.

Where you can enumerate all the possible
features in advance or let's say you

have a list of the Android permissions
all the hundred permissions and

then you compute the mutual information or
some other measure of future utility.

Now the problem is that this will work for

the common to talk patterns it
will ignore the uncommon once.

When a representation learning is a way
to discover useful features from raw data

this is one neural networks do so
again this avoids feature engineering.

But the challenge here is that
you won't get these complex or

domain specific features that I mentioned.

In the beginning of my talk in
some general disadvantages for for

both approaches is that
they are data driven so

they will tell you how useful these
features are for detecting now or

from a specific ground truth from
a specific dataset right and

they may reflect biases in
that ground truth I save that

that is old it doesn't have new behaviors

the features that represent those new
now behaviors will not be selected.

And it does not discover thread
semantics automatically that.

Yes.

That's lot more recent to me.

So it's only good with.

The social disadvantage so
our stuff is a little bit old right.

Actually have.

So for follow on work looking at blogs and
also not just blogs but

also because if you think about it
this is the voice of the defenders

read the papers we also look at
underground forums and paste bin and

stuff that's actually used by attackers So
that's the voice of the attackers right so

it's interesting to see what they
describe as their problems and

how they solve their problems in bypassing
existing security differences yes.

Yes Well right.

So.

You're.

Right either that's true so if we do not
find good behavior is then we we might not

be able to identify the meaning of those
features that's that that's absolutely

true another sort of more interesting
question is why does this work at all

because you know there's a limited number
of features that you can put in a paper

in a blog post right so even Drebin the
driven paper didn't list all five hundred

thousand features that they have right but
it turns out that when you know author is

highlight features they tend to start from
the best once they were they can explain.

Anything as all of this is enough
this these are really for for

Probably this wasn't at all obvious
when we started the project.

So.

Today you know many people are worried
that automation in artificial

intelligence with end up killing
lots of jobs right so I just want to

say that I don't think that teaching will
put data scientists out of out of a job.

And there's a labor deficit projected but
more importantly

this doesn't actually do the same thing it
complements what they'd are scientists to

do it doesn't replace them so specifically
human need our scientists have intuition

which is not something we have
figured out how to simulate it.

Was just myth can reason over a huge body
of knowledge I saw there are scientists

use feature Smith as a tool for
discovering useful features.

And.

More importantly I think that this
industry is the Prague promise of broader

techniques for automating
the discovery of threats amount so

in the few minutes that they have left I
just want to give you sort of a vision for

what automatic for
discovery could allow us to do.

And I'm going to I'm going to mention for
potential directions.

So.

One of them is this direction of
explaining the outputs of that the stickle

detectors right so in general when
we use machine learning security

many of these are type detectors or black
boxes they say I detected that this is

knowledge that they don't explain why in
this there is some work showing that this

actually has slowed down the adoption of
machine learning in the security industry.

Because security analysts are are
uncomfortable with adopting something when

when they don't know how it works right so
that you don't know if the detection

was based on some meaningful feature or
some artifacts or that he does it.

Right.

So.

If you just me it allows us
to link these features to.

The behaviors from from
the papers right so

this brings back some some semantic
meaning to the features and

you know the example they gave you would
feature Smith you know if you say see them

our sample is detected because it
uses a good operator name Art you get

a snippet of text that explains
how Mauer likes to use this.

A.P.I. call so the question here you know
can we use this style of of explanations

to bridge the semantic gap between
statistical inference is and

the mental models of of security
analysts by providing explanations

in a language that is colloquial and
it's how an analyst would describe.

You know the reason why why this this
feature was was used in the detection.

So that that's one potential.

Implication another implication
is discovering new threats so

when when I presented this work
somebody invariably asked so OK but

what you do is you just minute you
know existing known features right so

these are already in
the in the papers and.

And yes you know what why is this why
is this new right so they already know.

But cementing that works basically.

You know help us connect the dots
between concepts that sometimes that

appear in different in
different documents and

in particular if you look at these
new features that were missing that

I told you about that were missing
from the German features that.

They actually came from papers
that talked about privacy.

That's so they were not necessarily papers
that the malware detection community might

might be interested in so this knowledge
may be sort of ignored because of that.

But more importantly the links.

Between different concepts
may be may be important so

this came up especially when we
tried to explain expand our work to.

Two mining sites popular with hackers.

Trying to understand the semantics
of these malicious activities.

So for example if you want to
characterize malware campaigns.

Nowadays there are a lot of threat
intelligence companies that will.

Provide feeds of indicators of compromise
of you sees that correspond to

different thread doctors you know to
the Russians or the Chinese or whatever.

But the police of these O.C.S.
are indicators so

there are things like IP addresses or
hash is domain names right but

again you don't know exactly what they are
in particular you don't know what the role

they have in the campaign so
for example domain name can be.

The site that performs fishing or
the site that is a command and

control known or
the site that is hosting and

exploited threats so these are very
different stages in a TOC and

figuring out you know again
by by mining by by mining

the natural takes not natural language
text what is the role of an indicator.

In the campaign we can see that
are the same campaign use this domain and

then it is a different domain these two
domains are probably part you know.

They are behind the same sort of attack
pattern in the same with exploit

weaponization So if you look at
the exploits that are seen in the wild

they are different from the proof
of concept exploits that are.

Included in for example explained to
be because attackers have to add things

like platform fingerprinting they have to
improve the robustness of these exploits

they have to add more malicious speedo's
right so understanding the purpose of all

these coach changes can give us sort
of a glimpse into into you know.

The tasks that they're trying to achieve
rights of the challenges that they haven't

how the how they solved.

Prediction is another another
interesting direction so.

We write so it's somewhat
counter-intuitive Why how can you predict

security events that because in security
we're dealing with an intelligent

adversary who will do exactly
the opposite of what we want him to do.

But it turns out that today map malicious
code is has a lot of specialised

components and usually Coming up we're
weaponized malware code you know exploit

is beyond the skills of a single usually
beyond the skills of single of a single.

Actor So typically these these
are developed in a collaborative fashion

there's a lot of discussion on these
forums on how to weaponize exploits for

example and we can analyze this discourse
to forecast attacks I briefly mentioned

our our our work on predicting which
vulnerabilities are going to be exploited.

By by mining Twitter.

Case we had a two day median
lead time of detection compared

to the creation of signatures for
detecting those those exploits.

In the interesting insight there
is that we actually need both

classes of features so the the features
that we extract from Put are give us high

precision low recall and the features that
we struck from from the vulnerability

character sees themselves give us
high recall and low precision so

basically we need both kinds of features
in order to do a good a good prediction.

In other potential direction
is generating exploits and

cyber deceptions automatically because so
far I just talked about understanding and

predicting stuff but
recently there were a lot of advances

in automatic vulnerability
discovering exploit generation for

example in the cyber genome project
these things are very powerful but

the exist wants them to focus on memory
corruption exploits so the question

here is you know can if we mine
the underground discourse we understand

what are the challenges for the attackers
and how these songs all these challenges.

You know can we generate expose for for
a broader class of vulnerabilities.

Or can we generate effective cyber
deceptions for example by making it

look like a system is vulnerable to
a specific attack a specific exploit.

It but the system actually is not and

you know doing this with the purpose
of shorting platform fingerprint.

So.

I'm going to wrap up now.

I told you about of our semantic network

design which is a flexible representation
of the security knowledge.

In our system can discover open
ended mounted behaviors and

I also describe automatic feature
engineering which is a method for

discovering semantically meaningful
features some of them were missing

from a manic curated that are certain the
performance of the automatically engineers

feature set is comparable to that of
a state of the our our detector so

we actually have released
our semantic network and.

All the features that we engineered

at the site you can you can check it
out we hope that other people will.

Will build an hour.

Work.

And before I conclude I just want
to leave you with two thoughts.

So the first thought is
that automated systems

can understand the semantics
of security concepts.

And the other one is that
this is a powerful tool for

creating attacks and offenses.

Thank you for your attention.

But.

It's.

Yours.

Yeah.

Yeah yeah so.

I played with a.

Group of papers published for
two twelve to thirteen right.

But I don't.

Know if the question I mean
we can look into that yeah.

Yeah that's right.

That's.

Right explicitly specific you read.

Them to be helpful for one sample.

I think the interest.

Just.

So we had in our automatically
engineers feature set we had features.

That did not show up in
the malware ground truth and

they were curious so we looked at you
know what are these things right so

one of there was is music active so
that's that's one A.P.I.

call from Android and we want to OK
what why is this useful for formalwear.

And you know there's this paper that shows
that it actually can be used as a site

channel to leak location information if
you are driving while listening to music.

Then google mouse will interrupt the music
whenever it's telling you to turn left or

right and it turns out that just knowing
the timing of these interruptions

if you have enough it's probably
enough to locate you know somebody.

No matter what use is this in the wild but
this doesn't mean that they can't write so

this is this is a disadvantage that I
mentioned that you can actually find.

You can find features that
are not present in the data so

these are features that
are useful according to what

the security community thinks Not
according to that particular data.

Yes.

But by the outline you mean.

Yeah.

We we build ourself so

we started we had some you know
conference proceedings we looked in

the center as many as we could find
in then ultimately we actually.

Sign an agreement with the A Tripoli and

they give us a feed of all the papers
that are published here in Tripoli so

that we can mine them so we can you know
we got this feed weekly on a weekly basis.

So right so

because you you brought this this outlying
thing up I also want to mention that.

This literature base discovery is

established in the biomedical
engineering field.

And the benefit there is that they
can usually get away with just

analyzing the abstracts from problem and
for example and that's because in that

field people probably structured
abstracts where we have a very

strict structure and we don't in security
and I guess in computer science in general

we don't tend to publish structured after
abstracts so that's why we actually need

to parse the entire paper and eat it you
know there are some complicated sort of

issues related to parsing Pedia slake good
information using tables and it's hard

to you know pull it out then you know we
had to take care of those issues as well.

Who does it.

Make.

Sense.

Or.

It's.

Why.

So that we have a thought.

But it's a good idea so
that we can't let you write so

we can we can we can
probably apply this to.

You know many different problems many
different types of literature so

you know one that I heard is mining
protocol specific ations to try to figure

out the status of for you.

So are of seas mining or of seas but.

Why not mine.

Legal documents court sort of
documents Yeah sure I think.

You know I think it's an interesting
interesting direction in the pursuit.

It's.

No right so we you know if there
is no connection at all then we

say it's not related if there
is a connection by the.

You know the connection
appears in only one paper.

In that case you know we would
say that it is connected but

it would be lower in the ranking
that that we are with at the end.

Of the question.

Who is.

Prince.

Or system.

Right.

Here so.

Yeah very interesting monetization
if you have some thoughts.

So I mean this is.

The word I describe here
is a one shot thing so

we analyze this corpus of papers and
we've published our data right.

So you know everybody can
get it in our foreign work.

That on.

The mining underground sites
then defying the role.

For example I was seizing a campaign there
we're actually looking at setting up.

A Web site that is updated periodic I
think it's actually it's live I mean I

don't have the link here but
I mean we already set it up but.

Yeah that that work is still is still
a bit ongoing right so we can identify

the roles pretty well for a specific type
of so for our distribution campaign.

But we want to see if we can basically
more generalize this one to other types of

other types of threats and campaigns.

I think so so I think that this
work that we're doing with.

Exploit weaponization So the way we got
here is this observation that sometimes.

People with twitter hash or will tweet
a short code snippet later on will.

End up in some sort of an exploit So
this was the case for example with the.

Exploit that was used in the twenty
eleven attack against our A C.

which reportedly resulted
in stealing seeds for

the artists sort of tokens so they had
to record of one of tokens at the time.

And.

And yeah so it turned out that
that you know very specific.

Code snippet was treated maybe I
think a month or a few weeks before

the attack before the spear phishing
e-mail was sent to our say so.

And you know buying Brian Krebs
has a very nice article about it

who who was the guy who tweeted and why.

But but yes so we believe that there
are these sorts of things right that

you know are out there and this is what
we're trying to do with the weaponization

projects of trying to figure out what
is the purpose of these communications

of these of these codes that
they're exchanging that they're

modifying You know why are they
modifying it right so.

They are so that's that ongoing work.

There.

So I say that this paper that they
presented is a non-recurring thing

I would say that in
general of all my work.

So.

I I used to work for Symantec in
my job at Symantec was to build

this wine platform that shared
data with researchers in academia

it was basically one of the earliest
read intelligence platforms.

And you know the point there it was
that it was an ongoing process it was

constantly obviously don't just dump
you know one time sort of their set but

it's something that people could
use on an ongoing basis and

it actually had a pretty nice impact
in terms of the papers that that were

published based on on that data but but
yeah so this is this was my experience

before going back to academia and as
an academic I actually strive to you know.

Build stuff that lasts.