Flower Filter – An Update

A few days ago, Yasuhiro published an article on Flower Filter.

The main motivation of Flower Filter is that we need a data structure that remembers newer data with higher probabilities and older data with lower probabilities. We call such probability “survival probability”. In this post, we discuss some math subtlety of the survival probability.

First, let’s have a quick review of the background. We have an integer array A of size s, and A stores the fingerprints of events. When an event E comes, we use n hash functions to compute n location indexes, and E stores its fingerprints in these locations. (Due to hash collisions, may take fewer than n locations in array A. But let’s ignore this for now.) Suppose t events come after event E. These events may overwrite event E’s fingerprints. We say event E survives if at least one of its n fingerprints is intact.

In the original article, the survival probability Φ of event is computed as:

wrong

Here, p is the survival probability of one fingerprint of event E, and it is computed as:

eq3

The summands in the summation are binomial probabilities. Thus, an implicit assumption is: the survival of different fingerprints of one event are independent. If you are really picky about math, you will find this assumption is not valid.

A simple counter-example: consider n=2, s=4, t=1. In other words, array A can hold up to 4 fingerprints, and each event can store up to 2 fingerprints. Suppose initially an event E stores two fingerprints in slots 1 and 2. We need to compute the survival probability of E if there is one more event (since t=1).

If the survival of slot 1 and slot 2 were independent, then the conditional probability p(slot 2 intact given that slot 1 intact) should equal p(2 intact). But,

not-indepedent

Thus, intactness of 1 and 2 are not independent.

To get around this dependency problem, we can use the inclusion-exclusion principle. Let Xi denote the case that the i-th fingerprint of event E is intact after t events. We have

correct-survival-prob

Clearly, this looks very different from the original formula. The question is, in reality, does this matter much?  The answer is no, and we hope the plot below can convince the reader. Here, the green curve represents the new formula, the bule curve for the old formula.

flower-filter-compare

As we can see, the difference is negligible. In fact, in our system, n<<s, so the difference is even smaller.

The new formula assumes that event E stores exactly n fingerprints in array A. If you are picky, you can find the total probability by marginalizing over all cases (E stores exactly i fingerprints, i = 1, …, n). But we believe the difference is marginal.

Conclusion: In the original Flower Filter blog post, we implicitly assumed that the survival probabilities of an event’s different fingerprints are independent. In this article, we showed that that assumption does not hold, derived a new formula to compute the survival probability of an event, and demonstrated that the difference is negligible in practice.

6 comments
thandieudaihiep
thandieudaihiep

 @thandieudaihiep trying to understand formula of the survival probability Φ of event. Not sure to get it but at least  that formula can be reduced to simply 1-((1-p)**n) because sum{i=0..n} ( c(i,n) * p^i * (1-p)^(n-i) ) = 1. Please correct me if I am wrong

Florian H
Florian H

Interesting stuff...it's always good to fully understand the processes and mechanisms one is working with and relying on!

Y
Y

This is really interesting.I think that any reader should probably ask himself what is the context to this solution? e.g. what is the problem that you are trying to solve?What is the tradeoff?

T
T

Vey Informative! 

yingjieMiao
yingjieMiao

 @thandieudaihiep  @thandieudaihiep 

Thanks for the comment. You are right about the 'old' formula, it's a standard summation of Binomial probabilities. However, the new formula is quite different: we have an alternating sum, and each p is dependent of index i. 

yingjieMiao
yingjieMiao

For the context to this solution, please see Yasuhiro's original post.