[00:00:15]
>> I remember yeah I am easy to remember because. Yourself. At 5 years I was. Like a middle child. Or an. That. Network. And. Where you're supposed to. Track. The joke I've heard is that it's a marker drinking game is every time I say boom filter you're supposed to take a drink but if you don't have anything here.
[00:01:29]
I'm going to give sort of a whirlwind tour of you know stuff related to bloom filters so I'm going to start like at the very beginning which means that those of you have probably never seen me give a talk before seen at least the 1st few slides but I'll try to get through that reasonably quickly and then get to.
[00:01:47]
Rush to get some of the fun stuff at the end which is combining machine learning ideas with boom filters for some new types of data structures. All right so let me start with just the review of bloom filters the wonderful data structure that everyone should know but very few do because for some reason it's not in the C.L. arrest text book I'm so boom filters or you're given a sat on a universe and you just want to answer Cory's of the form is this an element of the set right now that should seem like a pretty easy question you know typically you'd stick things into some sort of numerical form you know sort them and then do something like a binary search but we're looking for something that you can provide an answer in constant time and in particular like a small amount of space so the idea is that we're going to actually store something that's less than the size of the original data so that you can think of it as a compressed form meant to answer these questions and the way we're going to be able to get this small amount of space if we're going to allow some probability of being wrong in giving an answer when I say some probability of being wrong I mean serve in a specific way right you could have either false positives or something isn't in the set but we say that it is or false negatives were some things in the set and we report that it's not.
[00:03:11]
And for blue filters we're going to allow some fraction of false positives but it will still have No it will not have false negatives you know so you're always going to get the correct answer for the things in the SAT you might get an incorrect answer for things that are not in the set.
[00:03:28]
And the reason I think blue filters are so wonderful and should be true of the undergraduate textbook even though they're not is you can really explain them pretty easily in one slide so boom filter starts with an array of and that's filled with zeroes and we're going to hash each function in the set some number of times and each time we hash the item that's going to give us an index location into the arrays and we set that location to be a one the same location might be set to one multiple times that's fine you know you're set to one euro one.
[00:04:02]
And now when I want to do a look up of an item I just check those K. locations I have that I had them with the K. different hash functions I have I look at the various locations if I ever see a 0 it's good then I know that that cannot be an element in the set because the way an element of the set works the way I've set it up if it's an element of my SAT I should only see once.
[00:04:28]
If I see all ones all report that it is an element from the SAT and if it's an elm in the set all is fine but it might not be an element of the set but it's just gotten unlucky and managed to hit all ones and they're right so you can calculate this probability of this sort of error.
[00:04:46]
So this standard or basic analysis works as follows Well the probability each individual bit of the filter is 0 after I do all this hashing so I have to miss the filter that happens with probability one minus one over am I have an item's in my SAT and I'm hashing each of them K. times so I have to miss at specific location and times.
[00:05:13]
And I often use the exponential form just because it's easier to write and a good approximation for reasonable values of M N N that so I can figure out the probability of specific. Bit in the bloom filters 0 and of course if after I do all this hashing rows the fraction of 0 bits in the filter then the false positive probability it will just be one minus wrote to the K. that is I have to have hit all ones the probability each of my hash functions hits a one as one minus row so the false positive the positive probability is one minus road to the K. and the idea here is that given the probability a specific bit is 0 that tells me the expected number of zeros inside the balloon filter everything is going to be nicely concentrated around its expectation so it's perfectly reasonable to take as an approximation you know this one minus P. prime value which is about this one minus P. value which gives us this fairly nice looking form what the false positive probability is for a blue filter.
[00:06:19]
In particular you know now I have something I can take the derivative of to find what the optimal number of bloom filters is and it has this nice form that it's a log of 2 times over an remember and was the number of bits and as the number of elements in the SAT So over and you can think of as the number of bits I'm going to use per element in my balloon filter.
[00:06:45]
Plugging this back in you can find the false positive probability falls exponentially in this value and move around the number of bits you're willing to use with a factor that's slightly bigger than 2. That was sort of a whirlwind and they'll be a few more slides on it but let me just stop in case there's anyone who haven't seen them filters before who had any questions just because I went through so fast.
[00:07:15]
OK. All right Sue Like here's an example curve if we let over any will $88.00 bits per item and we treat the number of hash functions as a continuous variable which is fine for plotting and we see that the OP the number of hash functions is between 5 and 6 so you pick 5 and you get a false positive probability of a bit over 2 percent just over 2 percent if you wanted less than one percent you can get that by the time you get 10 bits per item and again it falls exponentially so even with fairly reasonable sizes for big objects.
[00:07:58]
You can get a very compressed form say 8 or 16 bits per object and get very small false positive probabilities where if describe them the boom folks are all just one big array sometimes it's kind of nice to treat them separately for parallelization or other purposes that you have like a separate part of the array for each of the hash functions so you get like one location in each of the sort of sub arrays.
[00:08:30]
And it turns out that this has essentially are asymptotically the same performance the deviations are very small and it can be easier to paralyze now. Let me contrast the bloom full to approach with a other approach which gives you something that is very much like a balloon filter has the same sort of properties and will be useful in talking about some of the schemes that we're going to look at later so if you have all your set items in advance an alternative to creating the bloom filter I just described would be to use a technique based on perfect hashing So the idea of perfect hashing is if you give me and objects I can find a hash function that maps each object into a distinct location into an array of size.
[00:09:22]
So I have an objects they mapped and places according to a perfect hash function it may take some time to find that perfect hash function but once I found it you know assuming my sets not going to change that work has done and I have a hash function that I can I can deal with.
[00:09:39]
What I can then do is store a fingerprint of each element inside that array so now what I want to do the sort of bloom filter operation is this an element of my set I take the element use the perfect hash function to find a cell and then I hash the element to look at obtain a fingerprint and compare the fingerprints and if the fingerprints match then it's an element in the set and if it doesn't then I know that it's not yet of the perfect hash function it can depend on your setting you know typically the things you'd apply this you can get something that's like little then but it can still be a polynomial space so in some sense I made cheating in that I am sort of not going to count the amount of space.
[00:10:32]
For the perfect hash function but that may be reasonable because it may be subway near so asymptotically it doesn't doesn't matter OK. So this alternative construction sort of teaches us that bloom filters are not optimal at least in terms of this space versus air tradeoff rate and the reason is because if we're using these perfect hash functions each time I add like an additional bit per hour meant I am reducing the false positive probability by a factor of 2 right there sort of.
[00:11:08]
Now whenever I each bit that I have in the fingerprint is going to reduce the false positive probability by exactly a factor of 2 Not that point $6185.00 that I had before for the regular Bloom filter right so if I want a false positive probability of less than epsilon I would just store a log of one over epsilon bit fingerprint at each cell again here and elsewhere I'm going to assume that I have nicely behaved perfectly random hash functions which is a reasonable model for most situations that we're going to think about.
[00:11:41]
Again so the false positive probability for this is now $0.00 to the M. over and so log one of reps on but suffice and this will use about 40 percent less space than the boom filter I've described but in many ways it's a less flexible less desirable solution so in part yet I'm not counting the bits for the perfect hash function of the computation for the perfect match function in the situation also unlike a regular boom filter you can't add elements like if I wanted to add an element to the boom fault or I would just put in some more ones inside the arrays that would change the false positive probability but it wouldn't change the functionality here if I try and add a new element I have to redesign a whole new perfect hash function from scratch.
[00:12:31]
All right so why are boom filters so much fun or why do I think they should be taught everywhere and it's because in the real world there's an interesting trade off space in your data structures so I like teaching this to undergrads because undergrads at some point get used to the idea that there's a time space tradeoff but at least in when I teach the undergrad algorithms class the thing that sort of floors them is the idea that correctness is a trade off parameter that you might want to allow some form of error to get improvements in time and space and I think the bloom filter data structure is sort of like the easiest most natural or or gentlest example of that.
[00:13:13]
But so here are the 3 of the dimensions in the real world trade off space here the 4th one that you would never even talk about even in an undergraduate algorithms classes like programmer time right the idea of simplicity right that. People like It even surprises me how many people use boom filters and I think it's because they're so simple I don't mean this in a disparaging way they're so simple even like systems people understand them.
[00:13:46]
I mean that you know you can. You can understand that systems people when they're building something like they actually want to understand the parts in the pieces in case something goes wrong or they need to change something and so you can provide them some fancy highfalutin data structure and they're going to be like yeah I don't want to use that because if something goes wrong or it's not working the way I think it is I'm never going to be able to fix it or understand it whereas blue filters like systems people go it's like yeah that that's totally reasonable I get that.
[00:14:22]
Yes Yes Certainly yeah. And yeah I go over that a 100. The perfect hash from function you're putting some of that on the offline time that it may be expensive but yes it's an offline cost in that you know it's not going to be the dramatic cost when you're actually running the operation All right so again I'll train speed through the old stuff but you know classic uses of boom filters were for things like spell checking it's hard to believe but once upon a time memory was scarce enough that you might not think about keeping the whole dictionary because dictionaries were big so you might use a boom filter now false positives would mean that you'd actually accept the misspelled word rate so you went one high false positive rate but maybe that's OK for certain uses of your dictionary and boom filters have historically been used for these sort of dictionary type things for things like password security.
[00:15:24]
Key were driven ads. In databases there's a version of the joint operation that uses a boom filter right so the idea is you know I have some sort of queries that I have like all the records on X. Y. and Z. please and if it's a big set of things that you're interested in so the X.Y.Z. set is rather big instead of actually sending like here's a list of all the things I want information on you can send a boom filter of that and then the other person on the other side of this distributed database will check their records figure out what the matches are and the corresponding records back and the idea is that if you have a small enough false positive rate you may be better off sending you know the list over a sublime filter rather than trying to send the actual list of things you want to match against.
[00:16:19]
Signature detection so boom filters. Were for a while I think popular in security this is when you're just looking for actual signatures like specific strings within a file and so some of the early solutions that were operating on online speed were using sort of families of whom filters a boom filter for each you know length so you would take your list of strings that you wanted to watch for you divide them up by a length as each new character came in you could hash the updated strings for each of the lengths and look for signatures.
[00:16:56]
And so here what a false positive would give you is you think you have a match of something that might be dangerous or that you have to look at more closely. That you actually didn't need to look at more closely so if you can keep the false positive rate down you know you want to keep as few things unnecessary things on the slow path as possible.
[00:17:19]
And even since the earlier times are. Under a burger and I wrote a survey back in like 99 or 2000 and here we're coming up with like here are places where boom photos were being used and like since then now they're in like thousands of papers so we can never actually update the survey because they'd be too many things to write about.
[00:17:43]
But they seem to be getting used all over the place all right so the main point for for Bloom filters is that whenever you have a set or a list and space is an issue the bloom filter may be a great alternative. The one thing you just always have to keep in mind is that it's going to give you false positives what are those effects of those false positives how can you handle them in your system.
[00:18:09]
All right. So a couple more quick slides remove on the next thing deletions you know you can add items into a boom filter very easily just add a few ones deleting is a bit harder because you know you can see from this example if we delete some item X.I.I. by changing the ones back to zeros we might actually be affecting other items and taking them out of the balloon filter when we don't want to.
[00:18:36]
Sue one of the ways this is that with this instead of having each of these cells actually be a bit we need turn them into counters so then we're keeping track of how many things hash to each location now when we delete something we can document those counts and if we want to corresponding Bloom's filter zeros go to zeros non-zero go to ones and we have a corresponding boom filter and.
[00:19:03]
For using counters we have to worry about overflow all this you flash this site up and skip it and say that if you use 4 bit counters in practice you won't have to worry too much about overflow just by probability arguments you can show that it's very unlikely that 16 things and a passion to the same place.
[00:19:24]
And the way this can be used in practice is as long as insertions intuitions are relatively rare compared to look up so you can keep the bigger you know more expensive counting room filter in a slow memory keep the actual Bloom filter in and on chip memory and you just update the bloom filter whenever you get an insertion deletion and that actually changes the the boom filter part as you go back to the counting boom filter.
[00:19:54]
All right. So now I'm going to sort of switch gears and talk about something completely different but it will come back to bloom filters shortly because everything to us I'm so let me just make it a sure hands how many people are totally familiar with are going to be bored by cuckoo hashing not as many as I thought but it was the right people who I thought.
[00:20:22]
All right so. Cuckoo hashing OK So the simple version of cuckoo hashing the idea of cuckoo hashing is that we're going to use the power of choices. So instead of just hashing something once and putting it in that location we're going to give each item 2 possible choices that US will have to hash functions and the item will have to go in one of those 2 places.
[00:20:49]
When I have the picture of this in a 2nd but just to see or explain what this looks like so to insurgents item I can check both its locations if either of them is empty I can put it in one of those 2 locations and everything's fine if both those locations already have items in them then the new item that I'm trying to put in will pick one of those locations and kick out the other element that this is there and say sorry you have to go somewhere else now luckily that item has another possible place to go and if it's empty then everything's all well and good but of course it might find that there's something else and in its place and have it have to kick it out and so on until we get to empty slots OK so let's see what this looks like so here's a cuckoo hash table I'm using the split versions of the 2 hash functions will correspond to sub pieces and what these arrows represent is like well that's the other possible location for an item.
[00:21:50]
So a new item F. might come in F. is going to cause me no problems both of it's cells are empty so I just put it in one of them. OK this item G. It comes in and unfortunately both of it's things are full so it's going to have to kick one of its 2 up so let's say it kicks out element a a will in turn kick out element he will go to that empty spot everything will be wonderful everyone has a home.
[00:22:20]
All right I always like to ask the audience what's going to happen here yeah we have a cycle that we try and put in G. and it's going to end up cycling through and going on forever and we can't actually place G. successfully right. One way to just see this pictorially here because it's a small example is you can see I'm I'm trying to put 7 things into 6 places right and that's not going to work mathematicians call that the pigeon hole principle all the rest of us call it common sense but you can't put 7 things in 6 slots there you go.
[00:22:59]
All right so the question is that's a bad case you know one of those bad cases going to happen OK So to start off with What are some the good properties of cuckoo hashing So they're going to have constant look up time right because it's in one of 2 locations I'm just going to look at those 2 locations to find the item we'll discuss you can get high memory use ations and another plus is like many things in hashing is pretty simple to build and design.
[00:23:32]
But we have to worry about these failure cases right so. You know for mathematical reasons we might considered a failure just if it. Runs into a very long path because we can't tell if it's ever going to complete or not but also the bad case we're worried about there in the beginning was that we end up running into these cycles we're trying to put too many elements into too few slots.
[00:23:59]
So as all discussed the bad cases occur when a very small probability when the load is sufficiently low and in theory at least if the bad case happens you could always say a wall just pick new hash functions and rehash everything and that happens so rarely that I'm not going to worry about it.
[00:24:19]
Again that's a nice idea in theory practitioners may look a little worried about like or I'm going to stop whatever I'm doing to rehash everything to rebuild the hash table so we'll talk a bit about how we might deal with that as well right so the basic performance if I give each item 2 choices.
[00:24:44]
And the load is less than 50 percent. Then the failure rate is going to fall like one over and. So I'm going to have maybe a half empty table half the spots will have to be empty. But I can always know where things are up to these 2 locations.
[00:25:03]
We don't have to do any of the hash table bucketing with pointers or anything like that it's all going into the printer free. And moreover the maximum insert time is going to be order log. And at least for the 2 choice version how you prove all these things is that this all just boils down to a random looking at this as a random graph.
[00:25:27]
So each element has 2 choices and you view those 2 choices as being verge of C's on a graph right so each bucket is going to correspond to avert tax and now each element is an edge. And what you're saying is like is there a way to orient these edges right so for each edge it's going to get associated with one of the to avert a C.S. That's called an orientation and is there a way to orient the edges so that they each get their own vertex.
[00:25:59]
So in terms of what that means in graph terms what you would want is that each of the components of this random graph is either a stick with no cycles or at most has one cycle right if you have just one cycle you can still put him each item into its own.
[00:26:17]
Each edge into its own vertex. And the theory of random graph tells us that for you when you have a small enough number of edges you do indeed get small components right so you can extend this to what you can have most of you can you can have multiple disjoint cycles you can have you know a cyclical component.
[00:26:45]
Not just right you can have multiple components that are cycles or you know cyclic with other stuff. OK if there are all sorts of extensions what if you have more than 2 choices per element so now then you're dealing with hyper graphs instead of graphs so now you have to work with hyper graph theory instead of graph theory.
[00:27:09]
Or more than one element provocative right so now you're looking at orientations where instead of just having one thing you're having multiple things that can point to a bucket. These were variations that were looked at very early on. And. You hear the sort of results you get so if you have a bucket size of one and you have 2 choices you can get loads up to one half but it takes a real jump even as soon as you go to 3 choices you can get up loads of over 90 percent and then like over 97 percent when you go to 4 choices Similarly if you go to 2 choices.
[00:27:50]
But have bigger bucket sizes that flexibility can lead to higher loads. You know a a common version that we look at you can of course combine these things you can have multiple choices with buckets of size bigger than one also. All right so we're talking about failures in Cuckoo hashing that occurs when an item can't be placed well you know you're going around your cuckoo hash table you find can't find a place to put this item should you really just give up bird and call it a day and rehash everything.
[00:28:29]
Sue Another idea is that we say OK like once in a while we might find something we can't place. Let's just put it in it's own special stash right so if the stash is not empty now whenever I do a look up I just have to also check the stash and see if it's there but if I could keep a stash is like a constant size that's not going to be such a big deal right particularly if most of the time I'm not going to end up using it anyway.
[00:29:02]
Well how do you change it's 2 hash functions without changing everybody's hash functions. So it's a special hash which is kind of equivalent to. Something like this. All right. So here's a simple and smallish example again just using 2 choices says like well how do the stash place go so what this is saying like this was done for.
[00:29:39]
Like 10000000 trials. And like you know. About 99 percent of the time plus I'm never going to need a stash. But likely when I do need a stash I don't need a very big one right it sort of falls geometrically. And you can actually prove this thought after the spread the failure probability will fall like instead of one over N. it will fall like one over and the power of the stache size guy.
[00:30:17]
Or the stash size plus one right so you can reduce the failure probability significantly from one over and one over and whatever power you like by using even a smallest stash All right so now that we've seen cuckoo hashing we're going to talk about how to get to bloom filters by hash tables and can we use this sort of cuckoo hashing to get.
[00:30:49]
A better type of food filter and the motivation for this is remember we talked about how perfect hash functions can give you a boom filter but they're not very flexible and you have to go through all this work of figuring out a perfect hash function for your dataset so the idea is that we're going to replace our perfect hash function with a near perfect hash function the near perfect hash function being for instance a cuckoo hash table right and so instead of using this perfect hashing or perfect hash table we're going to use a cuckoo hash table in its place.
[00:31:27]
So the idea is again and have a cuckoo hash table and all store a fingerprint in the hash table corresponding to the item by one of the things is that this actually can support insertion Andalusian of keys so you don't have to worry about the deletion problem anymore.
[00:31:45]
And it can be very space efficient when you're using cuckoo hash tables for instance with buckets that hold multiple keys because they get very high utilization rates. OK So a typical configuration would say let's use a cuckoo hash table 2 choices per key for fingerprints within a bucket right so there may be 4 different keys within a bucket for corresponding fingerprints so you just hash an item look in the bucket see if you see its fingerprint in there if you do you say it's a false positive now an issue is that if you're inserting and deleting items as you go buckets will fill and an item may have to be moved and so how do we know where to move an item made if we have all the items if we have the sat in storage somewhere maybe we can go find the item and rehash it and decide where it moves but typically we might use a boom filter in a setting where we're not even keeping the items around anymore once we've stuck them in the boom filter so we don't have the key anymore just the fingerprint.
[00:32:56]
So it turns out a nice systems the sort of hack for doing this is to make your 2 choices not to fully independent house choices but to have it depend on the fingerprint So your 2 choices one is going to be the hash the 2nd will be the hash the original hash plus a hash of the fingerprint and now the idea is that when you need to move a fingerprint that's perfectly fine you can just X. or in the hash of the fingerprint and that will give you the 2nd choice both directions.
[00:33:33]
So you know. This is fine. A possible worry is that in the analysis for cuckoo hash functions you soon things are completely random Now they're not completely random anymore because you're doing things based on the fingerprint the fingerprint the small compared to the size of the hash table right so so will this still work.
[00:33:57]
And. This was one the places I came in looking at this doesn't work in practice it works great. And what I found was in theory it just fails entirely right which is maybe not the best place to be and when you're theory Titian trying to help out on the systems paper but.
[00:34:17]
Right in theory you need logarithmic size fingerprints but what you can show for the theory is that like the fingerprint size you need is log of and divided by like 2 times the bucket size so our bucket size here is was 4 so I think in theory the size fingerprint you need for things to actually work out nicely is like log in over 8 like get over it as a constant for all practical purposes so the theory is OK and it all works out in the end but you know if you're in theory at least if you're doing things super big enough you might run into problems but you're safe up into the multi-billions without running into any issues.
[00:35:01]
And there's been some further work by David Epstein on this sort of idea and shows that you can actually simplify this process to get at the same sort of performance for a better theoretical result. If. It's one of X. and each one of Y. so X. and Y. are 2 different elements that just means that they each have a choice of a particular bucket right there sharing a bucket choice but buckets going to hold up to 4 items so that's not a problem.
[00:35:49]
For further engineering there are some fun bits saving tricks if you have a bucket of size 4 great you can you know I haven't said what order we keep the fingerprints you can sort the fingerprints take the 4 most significant digits of each and those 16 bits can be compressed to 12 because once you saw it and that saves you a bit per item.
[00:36:12]
Which actually you know turns out to be useful so here's sort of a picture of performance the church sort of funny you want to read it right to left because you're thinking if you want to say like small or false positive probability it was a bigger problem filter.
[00:36:31]
Right so the point is that like here's the lower bound and here's the cuckoo filter and it has almost the same slope while the regular boom footer slope is worse so in particular that means that once you get to about a couple of percent you're better off using the cuckoo filter right and this is that one bit gap you just got from.
[00:37:00]
Using that sorting trick. To now that was all oldish stuff I mean the cuckoo filter stuff isn't that old that's like 2000 like 5 years old maybe. Now I'm going to get into some newer stuff. And I just have time for. The 1st stuff I'll go in actually fairly quickly it's adding this notion of add up to Vittie.
[00:37:27]
So the bloom filter we've been talking about the false positive probability. And again thinking from the system's perspective you might be thinking more of like a false positive rate like I don't care about the probability of an individual item I care about like I built my broom filter I'm going to have a collection of chords I'm going to ask for it and I'm going to ask what the rate is now the reason like you wouldn't distinguish those as the theorist as the false positive prog old in the false positive rate would be the same if you had distinct items rate you up to deviations than in coin flip sort of probability things.
[00:38:08]
But unfortunately in many real situations you can't count on your queries being distinct right if you're working in anything and networking particularly. If you're looking at flows you're going to see the same flow top all you know many times in the course of that flow and so if a false positive happens you don't really want to be asking you know hitting that same false positive barrier because that false positive typically means you're doing extra work to check to see what the issue is.
[00:38:41]
So we want to look at the false positive rate over a data stream keeping into account this issue that there might be duplicates within the data stream so we want to reduce or remove these duplicated false positives for repeated queries. Now. In order to do this we are going to need to move things around or fix things up when things go wrong so that means we really do have to actually have the original data in this cuckoo if we're going to have add up to 50 and there's been further work on activity which shows like as you don't keep the original data around there's limits to what you can do in terms of adapting.
[00:39:28]
So I'm going to assume that we have the original data around as long as we have the original data around we'll keep it in a mere goof hash table. So that locations match up and we know where everything is. In terms of what original items mapped to what locations.
[00:39:46]
And we can keep that in slower memory and the idea is that whenever we see a false positive we're going to in fact like how would you know that you got a false positive instead of a real positive right you have to check against the original data that's why you need to keep it around.
[00:40:04]
And in order to be adaptive we're going to allow these fingerprints to change by using different hash functions for the fingerprints. So here's a sort of idea in the case of buckets of size one we'll just throw some extra bits in it right so we'll have our fingerprints and we'll also store with that 2 extra bits that say this is the hash function that was used to generate that fingerprint so we can have that we up to 4 different hash functions.
[00:40:34]
And so whenever we get a false positive rate whenever we get a positive we check if it's a false positive or a true positive if it's a false positive we change the hash function associated with that element to give it a new fingerprint which will hopefully get rid of the false positive and not introduce any new false positives.
[00:40:58]
Because I want to get to some of the N. stuff I'm going to skip some of the stuff about you know we can do a Markov chain analysis of this sort of system to figure out the false positive probability. We can use the same idea even for buckets of science one here we can use the idea if we have you know buckets of science say for right before I use that fact that I have for the fingerprints and I could do things with their order I could sort them to save a bit rate here I'm going to do something else I'll just use a different fingerprint hash function for each location so now if I find a false positive I'll just swap 2 of the buckets and hopefully that will resolve the false positive.
[00:41:45]
And that turns out to work pretty effectively too. These aren't that useful so I'm just going to go to the summary and say you know that. You know this these add that to the tricks work pretty well. Are theoretical analysis is is accurate. So one thing you know that comes up is that these tricks work even better when you have these skewed data streams like to think about if you have a system and you're keeping track of flows if all the flows of the same size you know this will work just fine it actually works even better when you have like heavy kill distributions in terms of the number of packets prefer because if you can stop the false positives and in those cases where a false positive occurs on something that's very very heavy if you can get rid of that then you've saved yourself a lot of trouble right and so it turns out in the real world distributions that you see for high variance distributions this add up to 50 is even more important.
[00:42:59]
All right so I rushed that part because I wanted to while I still have 10 minutes left to get to the really new stuff which is combining learning theory with whom filters OK This came out of a paper of from Google brain and. Suresh. And POTUS are a submarine in.
[00:43:24]
Like told me I had to go read this paper figure out what was wrong with it so sort of I said often and try to do that. Because they were coming out and saying like not just for boom filters but for other hash based data structures saying like we can revolutionize hash based data structures by applying learning right and.
[00:43:47]
Since hash based data structures have been my bread and butter for a couple of decades I find this very threatening. So I felt I had to to look at this this paper right and the idea was that they were saying let's do something better than standard boom filters in a data dependent a way by you know with the idea that someone gives you a set and you say well I can learn the SAT right I mean think of.
[00:44:13]
An extreme case if you're sat is just like a continuous interval of numbers right if it's just an interval it's very easy of a learner represent the SAT it's like well that's between 100-2000 say yes and otherwise that's a no right you could do a lot better than a boom filter your data set was structured in some way right ans they said well we can learn what the structure is and come up with better Bloom filters.
[00:44:40]
So the idea is you use machine learning to develop an oracle that sort of pretty gives you a prediction gives you a probability that that item is in the set her not and that oracles should hopefully give you false positives. But you also need to have some sort of backup if you really want to avoid false negatives.
[00:45:00]
OK So so you know in a sort of block diagram form right this was sort of the structure they proposed right you get your inputs you have your oracle of your Oracle says Yep that's in the SAT you say yes that's in the SAT and there may be some false positives there and if it says no it's not in the set you say OK it's probably not in the set but maybe you've made a mistake on some of the things in the set so I'm going to have a backup boom filter here OK And the reason why this can still work or be smaller than in the original Doom filter is you just need your backup Bloom filter for the false negatives from this oracle right so this bloom filter can be smaller than any original Bloom filter and again this might introduce again also false positives here as well and negatives get thrown out but you have something that acts like or behaves like a boom filter.
[00:45:59]
So. Let me just try and get to as much of this is I can but if you want to see more i did someone post on this and have the papers out on archive. And the goal here was to look at it like what are the actual type of guarantees that this type of boom filter gives they're not exactly the same as the traditional boom filter which is kind of important for networking people who are going to use these things.
[00:46:29]
Coming up with an analysis of the size of what you need for a living room filter to be successful and some optimizations and this is all going to be Depp's in a couple weeks so yeah. Yeah. So I don't think we have any methods right now that deal with a shifting distribution per se.
[00:47:12]
The idea that they say in the papers you're going to have to monitor things is that it goes and if the false positives get too high then you're going to want to retrain. Which I again this is a flaw in their paper that I would say I'm skeptical about too right so they're definitely sort of marked them here there are huge assumptions about like the stability of your data and you know this how good your test set is in terms of predicting the false positive probability So yeah I agree there are huge assumptions and I'd say it's definitely a piece of future work to try and figure out how much you can widen things to make make this approach better so that you don't need as many of those assumptions.
[00:48:02]
All right so we transfer up like one formula type thing so again we have M. is the size of the SAT if we think of X. as being the size of the learned Oracle it will have some false positive rate associated with that you figure out empirically say and some fraction of the things will be false negatives.
[00:48:24]
So we're going to have a backup Bloom filter of size say a B. Times them for some constant B. for the backup and filter. And I'm going to soon that the bloom filter you know we've seen that the false positive rates for Bloom filters fall geometrically with the number of bits per item so we'll just take that as part of our modeling assumption.
[00:48:49]
And so you know. This part I must admit isn't rocket science you know you just sort of plug in the values and say OK well what's the false positive probabilities of all these things. OK And the key is I guess the false positive probability of the limbo and filter is you know you get some fraction of false positives from here and then anything that falls through the one minus F.P. fraction that falls through here you know is going to have false positives because it's a boom filter.
[00:49:23]
You get better than the B. bits per item because only a fraction of the items from the original sat are going to be false positives to you get to the 5 by the false negatives and so you can say yeah Lumumba I'm filters would be better than a regular boom folder as long as you can you make your learning or coal sufficiently small as governed by this equation right so that's the sort of thing that falls out.
[00:49:52]
One thing that people comment to these interesting is one implication of this is if you're dealing with Sats and you think of sets as being from a family of sets and you think of that you know the size of these sets as growing large. If you can represent the SAT with a Subway near amount of information right then eventually a learned Bloom filter will be better than a regular filter because over them filter.
[00:50:21]
You know this part is you know the part from the learning function will eventually vanish compared to the linear size required for the bloom filter. So then. The other thing that came out of this for me was that there are better ways of building these sorts of structures which is that instead of just having a backup filter you're actually better removing bunches of false positive at the beginning rather than give the problem is that there are too many false positives coming out here so if you can stop a bunch of them in the beginning it's actually a win so you end up having what I call like the sandwich learn room filter because you have a bloom filter at the top which gets rid of false positives and a bloom filter of the bottom which makes sure that you don't have false negatives so they have 2 different purposes but you're sandwiching your coal between 2 boom filters and that turns out to actually be better in terms of fission see.
[00:51:34]
One of the outcomes of the analysis is that it turns out in fact that when you do this the back back up one filter has a fixed size given the other parameters so. The idea is that you have to use the back up when filtered to prevent false negatives but at some point if you had extra bits you never want to put them in the back up one filter you're fine you've caught the false negatives you want to use all your bits to get rid of the false positives up at the top and that's more efficient so you have like the small fixed size backup one filter and any extra bits go up at the top.
[00:52:12]
So you get this you know for me is the fun sort of analysis I like to do so so I had fun with it. All right so now's a good time to end because it's like 1157. So. You know the things hopefully take away if you've never seen the cuckoo filter the cuckoo filters there is implementation up on get by you know the systems people I work with who actually did the real work and wrote the paper and there's like an open source implementation So anywhere that if you're a systems the person in the audience wanting to use a boom filter you may want to look and grab the cuckoo filter instead.
[00:52:54]
This idea of add up to video I think is important and there's I think. Already some additional work or you know sort of simultaneous with us that there's a group Michael Bender and others who were looking at this notion of add it to be for these sorts of data structures.
[00:53:13]
You know their work was more theoretical. You know I think our work shows that you can maintain a sort of simplicity in things that we like in these sorts of data structures while still allowing it up to 50. And then there's this new idea of like learned Bloom filters and related type data structures I think to me this is a interesting or exciting question of you know are there places where we can use machine learning to improve on what I would call traditional or boring you know algorithms and data structures.
[00:53:52]
In sort of new and interesting ways they can we use predictions you know how do we if we assume that we have the same learning tools that will give us good predictions about something how can we use those predictions to design what I want to do is design provably better algorithms where the provably might be you know provable assuming that you have an oracle with these properties but I'd like to be able to say provable things and not just say yeah we through machine learning how to put how wonderful it worked.
[00:54:23]
And liked I think this is a direction for being able to obtain provable results that involve machine learning that's the goal. With. Me in the game yeah. Well so so certainly for all of these data structure type things there's there's various things I'm trying to reduce hashing Bloom filters work just fine with pairwise independence they just need more space or K. ways independents even just need more space.
[00:55:16]
Mikkel Thor up in his group they're very big on. The tabular hashing and they have a recent paper earth that says that you know tabular hashing works fine for Bloom falters. It's not deterministic great the tabular hashing you know you need the initial randomness to fill the table right but after that then then yeah so if you like it works just fine with it but you need more space writing their analysis changes so if you need a company you need a constant small constant factor more space to get the same sort of false positive probabilities and sort of like asking right because things aren't fully random right to get the probability that there is a one in a location you're stuck using the union bound instead of the 4 you read and respond so it's weaker by like that factor but if you like the paper that slow and I had way back when on.
[00:56:23]
Why simple hash functions are good enough one of the data structures we looked at was from filters and one of the things that comes out of that is that K. Y. is independence is fine just there's a corresponding space Well yeah sort of Sagan. OK. So. I think I'd have to think about it or see the problem My guess is yes because hashing can do everything but then I would have to think.
[00:57:30]
But sometimes it takes thinking about before I figure out what it contains. And I got one more thing that's. A good thing.