Flower Filter – A simple time-decaying approximate membership filter

In this post, we would like to talk about a simple data structure we designed for an approximate membership filter with probabilistic time decay. It doesn’t sound simple, but the filter is easy to implement and is highly concurrent while achieving a very low memory overhead.

In our system there are a lot of incoming events. The system needs to detect recurring events and apply a special processing to them. Some people may think of using LRU cache for that task, but we don’t use it. Why? If the accuracy matters, LRU is a good way, but for us the accuracy is not the highest priority. Our data structure needs to remember newer data with a higher probability and older data with a lower probability. In addition, we are going to have a large number of filters in the system, so a low memory overhead is required. Lastly filters are serialized/deserialized frequently, and they need to be done fast. Therefore, we looked for a different solution than LRU, and we came up with the data structure in this blog.

Let’s start with the most naive way. Suppose an event is an integer number E, and we have an integer array of size s. For each event, the hash value is computed to locate the position in the array, and E is written in the position. An older event is more likely to be overwritten by newer events. The probability of survival of an event is

eq1

t represents the number of events after this event is inserted. This is an exponential decay function.

mhf-graph0

Although this works, in a real environment a hash collision may occur. If this happens among frequent events, they overwrite each other. Therefore we need to reduce the collision rate somehow.

How about having more hash functions? Instead of writing to a single place in the array, we write the value to multiple places using multiple hash functions. The probability that all hash values collide should be lower.  Let n be the number of hash functions. An event survives when at least one out of n places is not overwritten. How does the survival probability change? For brevity, we will gloss over the fact that collisions may occur among the hash values of a single event.

Every event writes to n places in the array. An event survives if at least one out of n places is intact. The survival probability φ of an event after t event insertions is

eq2

where

eq3

The following graph shows a few examples of different n (s is fixed to 1000).

mhf-graph1

With a higher n, the probability declines slowly at the beginning, which means that the filter remembers the recent events very well. The curve becomes closer to a step function as n increases. It is obvious, however, that the overall capacity of remembering events decreases as n gets bigger.

In order to compensate the capacity decrease let’s consider increasing s. In the following graph, s is increased proportionally to log2(n).

mhf-graph2

The capacity is still decreasing, but the size of log2(n) order seems to be a reasonable guide line.

We would like to use a bigger n, but the space efficiency is an issue. It will be great if we can decrease the required storage size without decreasing the capacity. Let’s try an approximate matching instead of the exact matching and examine how the approximation affects the accuracy.

We will reduce the bit size of an array entry. Another hash function is used to map the 32-bit integer to a smaller number of bits. Let’s call it a fingerprint and let b be the size of a fingerprint. Intuitively the probability of false positive f at a single place is

eq4

Considering this, the probability φ, which is now the probability that the filter answers “YES” including false positives, is now written as

eq5

where

eq6

In the next graph, the probabilities are shown with a few variations of n. b is fixed to 8.

mhf-graph3

Now the curves don’t approach to zero. The effect of false positive is surprisingly high for a big n. We cannot choose a small b when we use a large n. In my opinion, practical choices of parameters are { b = 8, n < 4 } and { b = 16, 4 ≤ n ≤ 32 }.

We designed a very simple data structure for an approximate membership filter with probabilistic time decay. We have three parameters (n, b and s) to configure according to application’s needs. The filter is easy to implement and is highly concurrent since there is no pointers to update. Most importantly, it has a very low memory overhead since it consists of just one array object.

12 comments
eishay
eishay

hey @jaykreps, you make me feel bad. You're always welcome at @42eng, we're just around the corner from @LinkedIn. Drop by to say hi :-)

samklr
samklr

@jaykreps pure beauty.

javasoze
javasoze

@jaykreps I miss him too! Big loss for @linkedin for losing a talent such as this! cc @CodeRemix

Tomas Pospichal
Tomas Pospichal

This is indeed simple and useful variation on the theme of Bloom filters. The relevant Wikipedia page links to a paper describing another (only slightly more complicated) approach to the same problem

http://en.wikipedia.org/wiki/Bloom_filter#CITEREFDengRafiei2006

where the information retained in the array is probabilistically related to the "age" of entries, while your approach retains part of the "identity" information.

 

The technique is in need of a short, catchy name (if it does not have one, already), to have a chance of becoming better recognized.

 

One possible variation, useful in case the array may serve as a potential source for other statistical information on a finite generation of events (aside from it helping to detect duplicates): by allowing n (the number of positions the current event overwrites) to change with the advancing time - for instance, decreasing it very slowly and stochastically rounding to an integer - one can reduce the strong bias for the most recent events. That would be useful if one wanted to estimate the frequencies of the most commonly observed events. Another small consideration in favor of such a modification comes from looking at the problem of "uninitialized" array values in a situation where the number of events (so far) is not enormous yet (compared to size of the array). Here starting with "n=size of the array" for the very first event would make the strong initial bias of the array go away (the array is likely indicating lots of "events" that hash to zero at that early stage if n is much smaller than s).

pascallouis
pascallouis

@eishay @jaykreps @42eng and knowing Eishay, I'm sure you can stay there as well!

Andrew Conner
Andrew Conner

 @Tomas Pospichal After much debate around the office, we decided to name it "Flower Filter", in homage to Bloom (unfortunately -- or perhaps fortunately -- Yasuhiro is far too humble to let us name this implementation after him). We will hopefully be publishing an implementation soon.

eishay
eishay

@pascallouis i was about to write that its not ethical to pouch engineers from previous employers, but some would call me a hypocrite :-)

alejandrocrosa
alejandrocrosa

. @pascallouis someone should linkedin skill endorse @eishay for "kidnapping engineers"