This week it is my absolute honor to introduce William Robertson from Northeastern University. He has been doing incredible work in the space of bug hunting, software security, everything that falls under that umbrella. So I'm going to let him take it away. I am very excited for this talk. Go ahead. Will. Thank you for that very warm introduction. I very much appreciate it and thanks everyone for coming out and giving you the chance to tell you about some of the stuff that I've been up to recently before we jump into it. If you bothered to read the abstracts. Well, thank you for one thing, but I've also decided to call small audible. When I actually sat down to write this talk, it turned out that I had a lot to say about some of the, the first couple of things that I mentioned in the abstract about bug hunting or sorry, yes, Bug content. So I'm going to focus on that aspect of my recent work, but I'd love to chat with you more about some of the other stuff that I've been up to with modern privilege separation. If we get that chance. So let's dive in. If we name. All right, There we go. So in looking back over some of my recent research efforts to find sort of a common thread that knit together things. I kept coming back to the bug, which you know as, as a security researcher, bugs and the resulting vulnerabilities are mortal enemy of sorts or so. The conventional wisdom goes there everywhere. They lie dormant and all of our software ready to be exploited by adversaries that invest considerable resources to turn them to their advantage. While defenders, on the other hand, are forced to invest considerable resources to find, remove, or mitigate exploitation of those bugs that ranges from developing new programming environments that try to prevent you from being able to even express these bugs are using formal methods to prove the absence of bugs in terms of a specification. Using static program analysis to hunt down known classes of bugs or perhaps using property-based testing. The check whether bugs abide by or whether programs abide by certain security relevant properties. They can speak correctly today. And thus not how these bugs be present in your software. Much of my own research career has, at the end of the day, focused on dealing with bugs in one way, shape, or form. Though, for example, I've worked on hardening web applications using dicot invariance. I've looked at finding and preventing code reuse vulnerabilities and browser extensions. I've developed static analyses or mixed hybrid and analyses, or hybrid, hybrid and LC for hidden backward backdoor detection Android applications. I've done quite a bit of work on the web. So for instance, looking at crawling the web to find vulnerable JavaScript that scale. I've done work recently at USENIX, looking at web cache deception, and sort of the failures that can arise in a systems-level look at distributed applications, even if the sort of individual pieces are themselves considered secure. And I've also done some work on fuzzing or algorithmic denial-of-service vulnerabilities. And I'm going to sort of focus on fuzzing for the rest of this talk. I wanted to quickly mention that in the last couple of years, I've been spending quite a bit of time on other topics as well, and I'd love to chat with any of you about some of these if they're interesting to you as well. Though, for instance, hardware, software co-design and on top of just five, to build more robust defenses against exploitation. With some folks at BU, I'm have a sort of a long running interests in embedded security and recently that's taken the form of rehosting embedded software. Do you transplant them into analysis environments where you can actually throw all of the sort of best practice, been cut bug hunting tools that we had at them. And this is with NYU and MIT Lincoln Laboratory. Or thinking about how to express and enforce security policy throughout the development too, design implementation and deployment lifecycle of applications and some that work is thankfully starting to come to fruition. Although with COVID, there are always challenges, but despite my efforts and the efforts of all the really other smart people that they're just working on this problem. 24, 7. There's still this huge day luge of software defects. And so how are we supposed to achieve any sort of security when, when we know that just bugs are everywhere. Well, I'm not going to try to stand up here and present you some silver bullets and say, okay, there's, there's, there's some solution that we can just adopt tomorrow to the problem of bugs. But I would like to tell you. A couple of recent efforts to better our understanding of bugs and hopefully to build on this, to develop better techniques for finding them. So I'm first going to talk about synthetic bugs and why BCG injection is a promising means for improving security and getting a better understanding of organic bugs as all the sort of trying to differentiate them in this talk. And then I'll talk about some recent work that builds on this two, on these insights to better find real bugs. Though, Let's start with synthetic bugs. Back in 2016, which feels like forever now, I, along with some collaborators at NYU and Lincoln, published some work at Oakland called lobbied. And lobbied stands for large Dale automated vulnerability edition, which is a pretty apt description of what we're trying to do. We're trying to generate many bugs and inject them into software. Which seems like the opposite of what you might want to do. Why in the world would you want to do this? Well, it turns out that there's actually a very good reason you might want to do this. As scientists. We'd like to be able to do things like replicate on standard datasets and perform fair comparative evaluations of bug finding approaches and in particular for fuzz testing. Because this is, I think it's safe to say, the most effective and widely used approach that we have for doing this in some sort of automated fashion. But it's difficult without a representative benchmark suite that crucially come supplied with ground truth, where you can take a fuzzer through it in it at one of these programs. And hopefully at some point trigger one of these bugs. And in doing so, with this ground truth, you, you're able to accurately sort of figure out which bug you've triggered and how. And you know that these bugs are triggered bull, and there's also some nice properties that you get when you have this sort of ability to generate large, large sets of ground truth. And without this, we're sort of left with alternative metrics that in one way or another are not completely satisfying, right? So obviously one class of these is the various forms of code coverage which are extremely, which are extremely useful, right? Obviously. But so a sort of a proxy measurement for what we actually care about, which is finding bugs. Another thing that you can do, of course, is just simply, whoops, that's not what I want to show. Tally up the number of bugs you, that you discover in a experiment. But there's suffers from triage issues and it is also unfortunately not really a trustworthy metric. And this last thing, just as a side note, as sort of a well established idea in the software engineering community. As was recently pointed out in that blog post that I previewed for you. That was published by Andre seller, current chair of x. Where he talks about his experience and being a reviewer on several PC's and appeal and SC and security communities. And it's a great read and I highly recommend skimming through it. And I would say that my experience matches a lot of what he talks about there. And in particular, that security researchers, you know, sort of tend to love finding bugs and we go absolutely bonkers for 0 days, right? And myself included. And so if you have a paper that can claim that it found new bugs, even if it's for small n, like one or two new things. That's highly predictive of acceptance. And what sort of often gets overlooked is the confounding factors and experimental setup, right? So, no, just because AFL, you slightly tweak its, one of its mutation operators, you throw it at some program that's never been before and you find new bugs. Does that mean that suddenly you're at the forefront of bug hunting? Probably not. Right. Now. These issues are probably, I hopefully, I hope, well-known and lava. This is kind of old news at this point, but I bring it up precisely because since its publication, it's become widely used as a fuzzing benchmark, right? And in particular law by n. And you know, on the one hand, this is highly gratifying. It's sort of a nice impact they have on the community, but it also didn't really go exactly as we hoped though. For one thing we had originally intended for lava users to generate fresh bug corpora on-demand, which would have this benefit that as we improve lava, you get improvements to the benchmarks as well over time. However, what actually happened is that everyone just used a lot with M and ran with that, which is completely understandable and respect. And in retrospect, because that's sort of the path of least resistance. But it also has sort of the bad effect of magnifying flaws and it's now standard benchmark. The second phenomenon that became clear over time is that fuzzers were consistently reporting just excellent performance on lava em. The point that you really have to start to aspect is this, is this, does this benchmarks. We have any real utility and discriminating between different fuzzers. And sort of setting a low bar wasn't really what we were aiming for when we published lava. And it sort of led to this larger question that we found scientifically interesting, which is now is synthetic bug generation of an ecologically valid approach to buzzer evaluation? That is, if you perform one on a fuzzer with synthetic bugs, do those results generalize to real-world performance? And so as experimental computer scientists decided to put this question to the test, and I'm going to preface the rest of this talk by very much stressing that this is not just my work. This was the combined work of people at several institutions including Northeastern, NYU and Lincoln Laboratory. So Brendan, no one got it When he's former PhD student and 10 week staff scientists at Lincoln. But also in particular during my PhD student Josh bond two happens to be also a Lieutenant Colonel in the Army and sort of imposing guy who's now at the Army Cyber Institute at West Point. And Ricky led this work and, and really all the credit should go to him. But I'll try to answer your hard questions as well, so all right. So to try to answer this question, we devised a set of experiments to look at particular the research questions. First of all, what can synthetic bugs tell us about the relative performance between fuzzers? Are some bug injection techniques better at producing harder bugs than others. And finally, how difficult are these synthetic bugs to discover compared to organic bugs, which is after all, sort of what we're trying to model in a way. And to follow along, let's quickly cover some details of how law works, at least lava first-gen. And what these bugs look like. Though. In order to satisfy the design goal of linking each injected bug to a witness, we have lava built on top of a dynamic analysis platform called Panda, which is really the brainchild, the Brendon pandas aversion of Kenya that supports the whole system dynamic taint analysis and record replay. And it's very cool stuff that should be used more often, I think. So what lava does is given some seed input, it is going to Record a execution trace while tracking taint for all the bytes. And that's heat input. And it's going to identify bytes that are so-called dead, uncomplicated and available at some point along that execution trace. I realized that's a little bit hard to see, but there's a little back all of random data that has appeared in the program as it computes forward in time. So these do it bites that we try to find are not related or not influencing control flow and they haven't been modified much where we have a metric to sort of quantify this. I'm not going to get into the details of that. But we were able to use this in a way. So at some point, we're going to inject code when this is available and siphon off this, do a into a global copy that we inject into the program. And then later on in execution will select to some attack points where do a values could be used to make a program vulnerable. And he's attack points are going to be instrumented with code that will perform a check against some magic constant. And if that condition is satisfied, blow up the program. So this achieves some of the design goals that we were looking for. And we've improved this in several ways with multiple non-contiguous sets of bytes and more difficult, the path constraints and these sorts of things. I'm not going to spend too much time on that, but we're going to mostly focus on these, this first-gen naive version. Because this is what people actually are primarily using. And if you look at this, you sort of can, can imagine that. Maybe I'll switch to this version that is a little bit more real in the sense if you sort of can squint and look at the code here, you can see that to ease these bugs take a very particular form. They're fairly noisy. If you think about this and a detection sense. And in particular, the use of inequality comparison against a constant makes them quite easy for a fuzzer to solve. And in particular, if you're going to integrate a concolic execution engine and do some hybrid fuzzing, then this is fairly trivial, but even the use of dictionaries and extracting literals from source code or from a disassembly are going to let you easily cover these lava edges. Comparisons pudding is yet another technique that can really easily deal with this sort of stuff using program transformation techniques like TIF us. So this is definitely a weakness and it's something that we were initially worried about with respect to synthetic bug difficulty in realism, but it actually turns out to only be part of the story. And not really the crux of it. In any case, sort of start down this path of quantifying how effective synthetic bugs are for fuzzer II vows, we had to construct a couple of datasets, and so we needed to create some benchmark programs. Though. We luckily had been running some public fuzzing competitions, which was open to industry and academia and called rodeo day and had pretty much are ready to go. That of challenge programs, of benchmarks. So that was pretty much not a problem to spin up. The next step was however, a little bit trickier. So you need in addition to this, a set of fuzzers to test, right? And there's obviously a lot of interest and fathers out there, right? This is reflected in the number of fuzzing papers and number of systems that are, that are, that are actually, you know, released as its source code and something that you can download and run. And as well as submitted papers, I can square that. And one of the last review cycles for USENIX security, a full third of papers to bid on had fuzzing in this. Hi everyone. It's a little bit ridiculous. So narrowing down this field to some representative state-of-the-art is it's not really an easy feat. But thankfully, at the time that we were doing this, there was a fledgling concurrent effort by Google to spin up a fair comparisons for fuzzer is called Google fuzz bench. And this turned out to be extremely useful because then you feel, look at this and say, okay, well, this seems to be a representative said with the state-of-the-art, we had some other considerations like we wanted to avoid sort of biasing too much towards AFL and that lineage, which is admittedly not all that easy to do, but we ended up with this, that of fuzzers in the end, which represents a range of techniques. So you have some mixed hybrid techniques that use concolic execution. You have some non AFL variance like Hong Fuzz. Yeah, you'd have some that try to do gradient descent approximation of inputs and link input bytes to program states. Nfl plus plus, which is actually quite interesting because it integrates a lot of techniques across a range of different academic and industry buzzing contributions. Though, That's great. And so now we have to collect some data. And hence, our benchmarks contain both injective bugs and known organic bugs. B, wanted to not only just count bugs, which we could fairly easily do. With some additional work for the organic bugs. We desired, one that we want to be able to sort of measure how close could get to each of these different bugs and the do this. To make determinations of that, we had to turn to our question of coverage measurement. Now, hopefully, yeah, everyone's familiar with block and edge coverage, which is sort of the, the baseline standard in the space. Any coverage based fuzzer is going to be doing something along those lines. But it turns out that all of these fuzzers have slightly different variations on how they actually define this stuff. And so we decided to sort of make an executive decision and just go with one. And in particular, we picked red queen's Q2 based binary coverage tool to sort of do this In concurrently with the fuzzer executions. And so what this gave us actually was a very nice sort of ordered stream of block, an edge coverage over time for each input, which is just a massive amount of data, but just great for doing some data mining. Though. Yeah. In the end we ended up with a fairly large-scale experiments. So we had to split this over 3 high-performance computing environments. I'm at Northeastern Westpoint and using MIT super cloud. In all. The experiments consumed about 733 thousand CPU hours or a little bit over ADU CPU years. We had about 60 million test case inputs, roughly 225000 unique edges that we recorded in our coverage database and 7.3 million crashing inputs for 3.4 thousand unique bugs. So we had a bit of data to crunch. Which sort of brings us to metrics, right? Like we have this data. How are we going to analyze it? How are we going to make sense of it all? There's a lot of ways to do this. Not all of them are great. There is hopefully like a large awareness in the community by now that these are fundamentally randomized algorithms. And evaluating randomized algorithms is known to be hard. There's a lot of noise are Curie and beyond which cleaves sites pretty heavily published them recommendations on how to do so in a sound way. But especially the case of fuzzers where there's a lot of outliers. There's a lot of variance. Reporting sounds statistical result is super challenging. And so just to sort of illustrate this, because I think it's kind of a nice case study. This is one of the coverage distribution plots, or the top five fuzzers on one of the benchmarks. I believe this is SQL light. And so they're ranked in a way from left to right. These are obviously box plots. The white triangle is representing the mean, whereas the gray line across the middle of the boxes, roughly speaking, is representing the median. And so you can already see that depending on your measure of centrality you choose, you have different results, right? If you choose the medians to represent the fuzzers than Hong FAS, which is on the left hand side here is the best. But if you choose means instead, then AFL is the best. You know, even if you don't have sort of flipping in the rankings, if you look further down to the right, you can see that sort of the closeness between these different sort of competitors grows or shrinks depending on how you choose to find things. And so the message here is that these summaries are not accurate and they really shouldn't be used. And so the question is, what should you be using? And the answer is not that the answer is instead you should be using bargain Delaney is a 12 or Mann-Whitney U. And this is something that Clay's mentions and in most work in the field ends up using Mann-Whitney U. I think something that's interesting is that even if you go to this effort to do things the right way, you can still hit these rough edges that bite you. So when one of these with Mann-Whitney U is that most implementations, including the popular ones, do not implement the so-called the exact method. They have an approximation which is inaccurate if you have sample sizes less than 20 and which makes the results simply be invalid. And if you look at the field, most papers are using less, far, less than 20 trials, their experiments. And you can then also look at well, okay, there are implementation problems with the most popular implementations out there of Mann-Whitney-U. So psi Pi's is, and currently still is just completely broken. But nevertheless, that's what people use. And so you know, it's, it's just a Wild West out there. And this is not to point fingers, right? I mean, I've made some of these mistakes myself, right? It's just did to get the message across it. This stuff is really hard. But hopefully it, if we know about these sort of pitfalls, we can do better in the future. But, alright, enough, preliminaries. Let's, let's talk about some meats. This table shows a summary of our experiment, experimental results. In terms of these metrics. We report the first and that can best performers on each benchmark challenge both for edge coverage number of bugs found. A2 is given for the fuzzer pairs. Bold a 12 values indicate that the distributions were not considered equal using that measure. Dark green is used to indicate Mann-Whitney-U strength of rejection of the null hypothesis. And bolded fuzzer names indicate that the fuzzer was top ranked in both edge coverage, number of bugs found. And so what do we learn from this? Well, you can see that 2M actually does quite well. In fact, I would I would say if we had to declare a winner out of this, QSEN would be it. And hopefully Q. Sims should be familiar because this is work from tissue whom I'm sure you all know. But to be fair, it's not a clear victory. And the results do vary depending on the target. Which raises interesting questions. Why was some target hard for Houston but not for others? I think there's a lot of interesting insights that can be drawn rum, from sort of the variations that you see here. And then unfortunately, we can only sort of go into things too much here today, but I want to maybe look at some other ways of looking at this data. This is taking a sample of ten representative trials. The percentage of bugs found in all of these challenges in a couple of different ways. So the total versus the average per target and orange versus the 90th percentile in green. And what you can see here is that again, yeah, excuse him, does quite well. It's top ranked and all the categories. But other fuzzers. Have very skewed performance, right? So AFL or be honchos an angora have per target discovery rates that are higher than the total number of bugs that they actually valid, right? So they did really well and some targets and just fell flat on their face and others. Other fuzzers like Hong PFAS are highly inconsistent. Though you can see Hong PFAS in Hong foes his case, you know, their best performance was quite good, but they're 90th percentile, very far away from that. Inconsistency is something that you kind of want if you're doing these essentially randomized trials all the time in your fuzzing campaign. Let's look at maybe one example bug, which is I think illustrative of the nuances of comparing these fuzzers. So this shows boxplots again for the minimum discovery time for a particular bug. For forefathers across a 101 trials. And note the x axis is a log scale where we are scaling out to longus experiment that we did. We actually did a number of nowhere near enough time to really talk about them all. But the longest one was out of seven days. So here, angora, Which is how Chen's work out, Davis had the best minimum discovery timed, but it also did not have the highest n. Right. So it wasn't it wasn't consistent, but when it worked, it worked fairly well. Whereas eclipse, her and Cusa in which both use sort of variant of concolic execution or an approximation. Application of these things did better in terms of consistency. They found bugs more often, but they also need longer to do this in order to solve the path constraints. So yeah, it's, it's, it's, it's really not clear cut as to sometimes, you know, if, if you'd look at this, you would assume, okay, Angkor is the best, but actually I would probably personally hope for consistency if you're going to do fuzzing it for the long haul. So let's maybe jump ahead to this question of relative discovery difficulty, but between synthetic and organic bugs, because this sort of gets to the crux of what we're trying to figure out here. Which is, are these things actually useful? So one way that we can do this is to compare the media and discovery times for each of these classes of bugs. Although we don't know the full set of real bugs in the corpus, that's sort of a chicken and egg problem that we have in dealing with all this. But we do know of some end day bugs, right? People have already found bugs in this, in the benchmarks that we use. And we actually knew about 50 different instances of them. And we're able to collect 35 actual PLCs. Though, with all of this compute and all of this instrumentation and all this careful experimental design. How many bugs do you think we found? Well, you might guess that it's not exactly what you would want or expect and yeah, it was actually 0. And to me this was really surprising if you think about it right. We invested hundreds of thousands of hours in these experiments. Many bugs today are found with fuzzers, but these are supposed to be state-of-the-art fuzzers. In fact, we know that at least two of the bugs, real bugs were found as part of were found by fuzzers and academic evaluation. So forgive my language, but what the ****, right? So here's what we can say. It turns off. It turns out that the majority of these 50 real bugs required a SAT and ACT turns out to be really critical and finding real bugs, although it also really creates a lot of noise in your bug discovery rate because you're going to find a lot of stuff that frankly is not exploitable, are not interesting. But another reason is that many bugs. For instance, 17 in the version of TCP, TCP dumped that we used required a very particular environment to be reachable. In particular command line arguments to be set. And most, perhaps all whole program fuzzers don't want to tap to explore that space at all. It's typically something that you just picks at the start of your experiment. But as we can see from this, if it turns out to be extremely important, and we'll actually come back to that later on. So here's where measuring a difference in bugged discovery, we really want to find out how close we are. More precisely how close our fuzzers under test came to finding these n days. And to do this, we tried to characterize this in terms of basic block coverage. Because of the scale the data does seem like us a way to summarize the distances. Did we do we took our POCs, we ran them through our instrumented binary well, our coverage collecting framework, I should say, and compared that to the coverage database that we had. And then computed these Venn diagrams I have up here for each of SQL lite, tcpdump, and file, where these are pretty easy to read, but the red circles are the unique blocks found by the fuzzers. The green box are the unique blocks covered by the POCs and the intersection is yellow. Though these tell actually, there are only three instances here, but they tell really different stories, right? On the one hand, on the left, you have a large proportion and 34 percent of blocks that were not covered at all by any fuzzer and all of that compute just didn't even get close. And this sort of makes sense if you think about the large number of path constraints that SQL light has. Sql, very regular structured format. But that may not be the whole story. Write SQL light has also been continuously fuzzed for years as part of Google OSS fuzz. And so it's difficult to say whether that is really the reason. I think really interestingly, file B3 illustrates the other side of the spectrum. And we got really close. Comparatively speaking, we only missed by 13 blocks out of 1.8 thousand. And we know that two of those bugs would have been easily trigger will be covered them. They didn't require a SAN or anything like that. So why didn't we cover them? Well, to try to understand this a little bit better, we decided to try to measure the distance more directly between fuzzer coverage and synthetic bugs in terms of in comparison to these organic bugs. And given the scale, the data, we again tried to come up with a way to characterize this in a quantitative way, which we call a mean path as sort of a standard for or stand in for easy to cover code. And specifically we'd find this mean path as the edge coverage that you achieve, where the constituent edges have a mean or sorry, median discovery time. Lesson 1, our overall experience, so they're relatively quickly found is sort of the, the intuition there. And of course, you know, there's, the mean path is not solely a function of the program, but rather of the program seed input and fuzzer. But if we can draw these curves are or compute this mean path, I guess we can then approximate this distance between the mean path and this is sort of what you end up with under some assumptions, right? So what you're looking at here is the CDF of the shortest path in the interprocedural control flow graph between the mean path and synthetic bugs for these three benchmarks that we know also contain real bugs. And that is obviously a approximation because the shortest path is not necessarily a feasible path. But we also still think nevertheless, this is very instructive because the distribution is heavily skewed towards a very small pads. And in fact a full 85 percent of all bugs, if you look at this, were one edge or less from synthetic products, I should say we're one edge or less from the mean path for each benchmark. Sql light shows up with this long tail that sort of curves off to the right there, which, which makes sense given the number of POC blocks that were never reached by a fuzzer. You can contrast this with a manual analysis of file D3, which I don't have the figure here for unfortunately, but which showed well for one thing, file B3, basically every, every, every synthetic bug that was within one edge or less from Covered code. Whereas if you did a manual analysis of the distance represented by those 13 POC basic blocks that I showed you before. The fuzzers would've needed to cover two whole additional functions. And really so the distance was magnified a bit. So earlier when I mentioned that lava V1's Magic Valley comparisons were problematic. They weren't really the entire story. And the reason is that this difference in distance from COVID code actually seems to be the key. And perhaps the critical reason for the difference in difficulty between current synthetic bugs and their organic counterparts. Why is this? Well, lava? It's a sort of the, the reason that the synthetic bugs are easy is sort of a function of the way lava works. You have this dynamic path and you're going to inject bugs that are adjacent to that path, essentially, right? And that short distance makes them easy to find. But as we have learned and as we sort of empirically and see. Further, there's there's work from Marcel bone at FSC 2020 that shows there's an exponential difficulty as distance increases from Covered code to do covering new code. There's sort of a power law distribution that comes into effect. And we see that exactly here, right? Even if you are only talking about 13 basic blocks away, covering that can be an insurmountable distance. Even if that doesn't sound very hard. Though, in any case, the big picture here is that organic bugs into be significantly more difficult to discover than synthetic bugs. Certainly for the, certainly for these first-gen lava bugs. Which raises the question of where to go for, where to go from here. And it would be a very easy conclusion, I think to say, well, okay, this, Let's use real bugs. And with datasets like Magna adamant that he aspires lab. This, this is something that one could do. And I think magmas approach, which uses real bugs and also comes with some, some kind of truth, has a lot of value, but there's also a lot of value to scale and the ability to inject many bugs that I don't really think you can easily match with real bugs. Though. I think instead, it's worth thinking about how to address these issues. Especially because in my opinion, the pursuit of narrowing this distance between what synthetic inorganic bugs looks like itself sort of broadens our understanding of, of real bugs and how to better find them in real code. And this is something that I don't think that we, as a community really have a full handle on. We have like very precise ways of expressing how bugs up here in terms of these specification Nunes and properties, but in a very narrow sense, it excludes how do we reach them. You know, or what is the diversity and complexity of the data flow involves in order to trigger? And I think there's some very interesting directions that could be explored to answer these questions in and gain that understanding. So one of these is if we're going to fix synthetic bugs, diversifying the injection point steams really key. That that is something that we want to be able to do to better mirror what actual, real bugs look like. And if we want to do this, then you can think about defining metrics like Okay, what is the complexity of the path constraints that you end up with real bugs and, and sort of try to base or tune or difficulty of synthetic bugs against that. There's also been a lot of progress in applying machine learning to some of these tasks. And so, for instance, I'm thinking of some recent work. Embeddings for neural networks are binary code. And if you have the ability to sort of easily throw stuff at a neural network, then suddenly becomes to create a generative model decade, maybe just generate instances of these that look like real bugs. And so I've ever research scientists, I just heard that really wants to work on these things, but I would love to chat about this stuff more because I think it's important. Yeah, so, but yeah, maybe, perhaps now, like the motivation for the talks title is a little bit more clear-eyed and sort of the course of carrying out this work and maybe gone a little crazy. But I've gone from viewing bugs as purely sort of adversary to stamp out, to itself being a valuable object of study and that you sort of have to become the bug to defeat the bug. So, but I'm gonna, I'm gonna switch from synthetic bugs and then talk about now some work in progress on human guided fuzzing. That, that really builds, I think in some ways on actually, in some ways directly on the insights that we got out of this large-scale studies. So one thing that we really observed pretty immediately about state-of-the-art fuzzer performance was that these physicians tend to plateaus relatively quickly, roughly around 24 hours, which again dovetails nicely with past experience. And this sort of theoretical formulation Marcel published at FFC. And what you end up if you sort of imagine this mean path in the context of the programs, I see if G is a sort of maybe cones of coverage, right? Where you start with some initial seed and a fuzzer, depending on its capabilities and the difficulty of traversing a adjacent edges, is able to sort of expand this coverage frontier. And for each seed maybe you get sort of a cone through this, like very idealized I CFG. But at some point you plateau and you sort of know, are not able to cover anything else because the path constraints just to get too complex. And I think a lot of fuzzing work really focuses on technical improvements to expand the frontier and delay the plateau. That's very valuable, right? But I really feel like there's a very Concentrated focus on this one piece of the end-to-end fuzzing procedure, right? And so this can take the form of more sophisticated fuzz configuration scheduling algorithms that are mutation operators and scheduling those more often are optimally. According some definition, input region privatization and linking that to program states. Concolic execution, which as we saw, can be very effective in the case of Cusa. And so long as you don't drag down the execution rate on test inputs. But I want to really stress this is just one piece of the bigger puzzle in the sense that when people go out and actually use these, no, they don't just watch the fuzzer and walk away. I mean, that's sort of this sort of automation is the sum of the selling points, some of the selling point that we try to sell and these fuzzing papers. But buzzing practitioners actually, from talking to a lot of them do something a bit different. So fuzzing practitioner here are fuzzer operator, I guess as I'll call them, represented here by our fuzzer wrangler are involved at other points in this process by it. So what ends up happening actually is that you run a fuzzer, you get a coverage report. And then you look at this conference report and you say, All right, there's, there's this piece of code or this piece of the program that we weren't able to actually reach. It looks like maybe I could elect another seed or something that would drive me down that path, right? So maybe you can modify the seed sets and, you know, do, do something smarter to get to another part of the program. The fuzzer wasn't able to get to before. Or maybe you see, Oh, there's this code that I'm not able to reach. Maybe it's because there's a defect in my test harness. Maybe I should add this command line arguments or if I'm using a fuzzer and maybe I should use a different parameter to the APIs that are in play here. And so this is actually best practices. What people do, this applies whether you're a first party developer that has fuzz testing built into your CI pipeline. Or if you're a third party that does this is a public service or perhaps for other purposes, right? And I think academic research hasn't completely ignore these issues. There are papers out there that look at things like ways to automate seed selection and things like this. People acknowledge that these are important parameters. But comparatively speaking, comparatively speaking, there really is not much. And that's unfortunate. Yes. So in that context, we sort of wondered whether we could build upon to build upon these sort of understood the pieces and more tightly integrate what we're able to do with our coverage measurement. And fold that into these covered reports in a way that the human could help guide the fuzzer more effectively to code that is difficult to reach. Though. We call this idea compartment analysis, where we're able to sort of point out, Hey, look here. This is like your best bang for the buck. It's sort of a cost-benefit analysis if you want to think of it that way. And I want to note here that we didn't invent this term compartments. This is actually something that to my knowledge, was coined by Yan and company and their driller paper back in 2016. But also to our knowledge, there hasn't really been a good definition of what a compartment is, how you rank them and quantify them and giving any guidance in that respect. I think that there's actually a big gap that I think in part this work helps to close. So this is what the overall workflow sort of looks like. Though. In addition to collecting standard coverage information, we're going to collect profiling data as well using elegans built-in pogo framework, profile guided optimization framework. So that's going to count, you know, not just did we cover something, but how many times did we traverse an edge or a block to sort of do more precise attribution of why a fuzzer was not able to cover some code that is gated by an edge. We also fold in LLVM base dynamic dataflow analysis. So we apply DFS in to some of the as to some of the inputs and the test harnesses and then marking them as sources and then sort of propagate that take forward using DFS and to figure out, okay, is this edge that we're unable to cover due to a lack of some deficiency in a harness? Or was it due to some problem with the inputs? Or alternatively, is there nothing that we can say at all? Because dynamic dataflow analysis also will give you that response quite a bit. So I think based off of time, I'm going to get forward maybe a little bit. Yeah. And you can talk about the experiments that were carried out. To date on this. So we again had to construct some datasets to carry this thing out. These are the particular benchmarks that we looked at in this time. Instead of drawing from our own set, we were able to draw from bus bench, which in turn is sort of based on a snapshot of OSS Fuzz, though something that is actually used in practice to find real bugs in real software. And so these are our industry strength testing harnesses, if you want to think of it that way. We structured our experiment as sort of a traditional real fuzzing campaign might be that we took a best of breed fuzzer, FL plus plus in this case, because out consistently performs extremely well on fuzzed bench for a control group. We ran that for 24 hours on these using the unmodified fuzzing harnesses and seeds. Or the experimental group. We did that for 12 hours. Then we paused, took the initial or incremental coverage that point, elected the conducted a compartment analysis, collected the top 20 heavy hitters, attributed those using dataflow analysis as to why those compartments might be locked. And then had a researcher are playing the role of a fuzzing operator study that report and then come up with interventions. Interventions were applied. And then fuzzing continued for another 12 hours. We repeated this 50 times. So the fuzzing. So to operate on median datasets on both sides for the experimental group and obviously just one run for one set of runs for the control group. And this is what you end up with. Though I realize this is a little bit difficult to see, but what these are are LLVM code region's, though, a little bit more nuanced metric of coverage than simple lines that, yeah, four on the left. So the blue boxes are our box plots indicating regions covered by normal fuzzing. And the ones on the right are sort of human assisted fuzzing. Both for 24 hours total. And this is, this is fairly nice as a result actually. So you can see that it's not a panacea. Like in some cases, there's not necessarily a massive improvement. But across the board there's quite a bit of quite, there, there's a number of cases where there's there's a lot of daylight between those two boxes. So the absolute but yeah, an absolute number for the relative improvement is in the boxes below, you can see it sort of ranges from 1%, which is not really interesting, but all the way up to, I think, 94% over there on the right. And so there's a range of effects. In fact, all of the benchmarks except for lip, JPEG, turned out to have a traumatic coverage increases. Though. I think in order to leave some time for questions, I'm just going to move ahead here. There are some interesting things. Maybe I'll just show this and we can leave it there. This is diving into two of the compartments on the left before compartment analysis for live XML, you can see there's, there's we identified two compartments highlighted in red with the gating edge of 167 and 391. These were weighted very highly. We came up with interventions that on the one hand on the left, you basically completely covered that entire compartment. We dramatically increase the coverage of that fuzzer of AFL on that benchmark on the right, we had sort of a partial result, which I think is, is fair to say something that you would see fairly often. It's not as if you just break next compartment. You have the entire thing. This is more sort of a iterative approach that you can apply to a fuzzing campaign over time. But the flip side is that fuzzing campaigns do run or long periods of time and sort of the benefits, the incremental benefits that you get with something like this add up and are complimentary to any sort of benefits that you get in sort of gendered, buzzing research that considers only sort of expanding coverage frontier. Though. Yeah, I think I'm running up at the end of time. So I'm going to sort of up here and say, yes, compartment analysis is not a panacea. There is some dependency on the human expertise involved. But the flip side, I would say is that our fuzzer operator was not an expert with JPEGS or from the other benchmarks here. Though, some programs are simply hard targets. We still think it's pretty promising. I would love to tell you about some of the replication that we've already had, but I can't talk about the things that go on at those places here. So, but this stuff does work and a lot of cases, it's pretty exciting. Though. I'll conclude. Yeah, we took a deep dive into bugs today. Thanks for coming along with me. We saw the results of a large study of synthetic bugs. In comparison, the real bugs, we learned why fuzzers perform better than others in some cases and in turn, some implications for how we can better measure their performance and report results. We also saw that synthetic bugs are significantly less difficult to find the real bugs, which limits their utility for fuzzing benchmarks for now. And in particular, people should stop using lava. And if there's one takeaway, maybe, maybe that's a good one. But we gain some valuable insights and return. And like mean path phenomenon led to the development of urban analysis. And more broadly, I think there's a lot more to learn about bugs and that there's a lot of value in that effort. I think it's again, useful to understand more than the sort of minimal specification that defines a Bugs essence, but also the context of how they appear in real programs. Finally, if, if you needed convincing, I hope one last thing that you take away from this is that there's a lot of value in experimental computer security. And the fields can do always with more replication and more science to hopefully better guide future work towards that elusive goal of taming as well. Bug. With that, I'll conclude and I'd love to take any questions you have. Awesome. Let's thank our speaker. Today. Any questions from the room? I was one of the people who read the abstract, but I was not disappointed. If anything, Any questions? Yeah. I'm sorry. I understand that. I saw I mean, he seems to me that the reason that synthetic bug is easier to find than northern Guinea bucks poverty, the way that you construct the, the bug is not deep enough for me. I just wonder that we know so much about organic bugs can be like make RCTs and the box somewhat. The malaria in terms of the pogrom constructs or the chicken conditions. I mean, it seems to me to look at is problem independent of fuzzy, meaning they're using other moves that are specialized spot funding techniques in and we join us in conclusion. Yeah, absolutely. So I think, you know, fuzzing was one sort of context in order to investigate this. But yeah, it, it is orthogonal in terms of how you might go about remedying the problem. Yeah, absolutely. And so, you know, some of the things I talked about like, you know, building a model of real bugs or having some metrics that you use to then sort of base difficulty alphabet, yeah, like reusing existing code constructs that you pull from real CVEs, I think is a great way to go about it. I think, you know, it also gets a little tricky and that you not only need to replicate characteristics of real bugs that you think are important, but if you think about this is sort of an adversarial game which you know, you don't like to think about benchmarks as in that sort of context. But the more noisy a bug is, in other respects can also sort of come into play. And so you might care about things that otherwise wouldn't necessarily matter. Like having a data flow looks weird in the context of a particular program. That could also be a problem even if it was, I feel like a real bug. So I think there's a lot of considerations here and it's sort of like intellectually very interesting problem. But you have to answer question. Yeah, great. Yeah. Yeah. Yep. Thanks. Yeah. I mean, I think that question is that to me, I think. And you study the TA is saying that some of the now I need to read the papers. They you the reference comparing pretty good posture as at different campuses today. So I think it's, I think it's correct to say that insecurity once we find out bugs are awesome. Yeah, I'm going to hand, it seems to me that me, I know that resembled paste in here and also conic seen some of the places they're doing this automatic generation of exploits. I basically say, hey, to prove that the bugbear financially duty can be exploitable. Because earlier though, I mean, I would love to hear your comment on how important that kind of workers and in also related to synthetic box seems to me, well, since any bugs obviously because generated, those are obvious, exploitable otherwise what's the point of, you know, so, so upon his hadn't even the organic bugs maybe harder to find. They may be, they may be, they be, they may maybe inconsequential in the absence of that, they may not be explained away anyway. That's absolutely the case, right. So I think, you know, the, there's a couple of pieces here. So on the question of research into exploitability of bugs, I think that is extremely valuable. I mean, it so much into triage. I mean, we have the Steelers, your bugs. We need to be able to sort of sift through the ones that matter. And so, yeah, that that whole sort of area of research is something that I would not criticize at all. I think there's a lot of value there. When it comes to the other side of things. I think. I'm sorry. I think. So. Photosynthetic bug, you get ahead of this one, a bug that you generate medically. They are exploitable yes, stamped and may give you another way to measure. Performance of any of the fuzzing, any of the finding techniques because we needed that they are capable consequential kind of buck. Yeah. I think I see what you're saying. Yeah. So I mean, it's it's it's sort of an assumption that we're just going with, that we treat all bugs equally. I think there's, there's definitely room to sort of pulled in exploitability as a way to sort of tune the models that you learn and to tune the kind of synthetic bugs that you create. Certainly those are the ones that we care about. Yes, I completely agree there. So actually, yeah, it seems like there's a nice opportunity for collaboration and synergy there. Yes, if you don't mind a question from the host. Do you mentioned a lot about human involvement, writing me contact, and I mean, of course, a lot of these tools are miles away from automated. I'm, I'm wondering if you could comment on the pipeline for that sort of human, right? I mean, that's an in-depth understanding of binary analysis on PhD levels, right? So, where, where are these humans coming from? How do they play a role? How do we get enough of them in the world? I mean, can you comment on that? We were talking specifically about sort of the fuzzer operator role that we had? Yeah, exactly. I would actually say that. We don't necessarily I would agree there's not many of those people out there. I am looking for them myself. But I would actually say that the our, our, our exemplar and sort of the role that we envision for this doesn't need to go that deep. So we're dealing with source code here. So actually binary analysis doesn't come into it. Although that could be useful in other circumstances. But the way that this compartment analysis works, I really didn't get a chance to show you the reports, but you, you get pointed to a point in the code and sorted. The idea is that you can use clues such as okay, this is symbol name that suggests something like okay. In the case of file oh, this is uncovered code is dealing with the document files. Maybe I should go get a Word doc and just drop it in the seed said it's sort about that level. And I think, you know, if you have a basic sort of undergrad Computer Science level of training? That's that's totally a fair ask. Think. So. Yeah, hopefully hopefully that clarifies our expectations on these guys. Yeah. Yeah, absolutely. Thank you so much. Other questions, anything you learn? Yeah. Hi. So what parts of the compartment analysis cannot be automated? Getting what parts of that can be automated. Simply, certainly producing the analysis itself is fully automated. There are difficulties that come up in a practical sense with the data flow analysis. So def, DEF CON has a lot of misses, a lot of stuff that turned out to be fairly evenly implicit flows that could be solved fairly easily using heuristics. But the entire thing in terms of gendering the, the analysis report is automated. It it sort of acting on that, that is more difficult. I think, you know, if, if, if you wanted to sort of maybe drawn from NLP approaches, you could start to think about like pulling out tokens from these symbols and maybe connecting back to this automated seed. That construction idea though. The, okay, there's a lot of Doc X tokens and these symbols, Let's maybe just go ask Google for a bunch of them. That could be, that could be a way to sort of automate that aspect of things. I'd have to think more deeply about them. Okay, let's thank our speaker.