Bloom Filter Essentials
What a Bloom Filter Does
Section titled “What a Bloom Filter Does”A Bloom filter is a very compact data structure for answering one question:
“Have I seen this item before?”
It is designed for fast membership tests with a tradeoff:
- It can say “definitely not present” or “probably present.”
- No false negatives: if it says “not present,” that is guaranteed correct.
- Possible false positives: it may say “present” even when the item was never added.
That tradeoff is what makes Bloom filters so space-efficient.
It does not support exact membership, and the basic form does not support safe deletes.
Core idea
Section titled “Core idea”A Bloom filter is:
- a bit array of size
m khash functions
Initially, all bits are 0.
To add an item
Section titled “To add an item”For an item like "cat":
- Run it through
khash functions. - Each hash gives a position in the bit array.
- Set those positions to
1.
To check an item
Section titled “To check an item”For an item like "cat":
- Run the same
khash functions. - Look at those
kpositions. - If any of them is
0, the item is definitely not in the set. - If all of them are
1, the item is probably in the set.
Small example
Section titled “Small example”Say the filter has 10 bits:
0000000000Use 3 hash functions.
Insert "cat" — suppose the hashes map to positions 2, 5, 8:
0010010010Insert "dog" — suppose hashes map to 1, 5, 9:
0110010011Check "bird" — suppose hashes map to 2, 4, 8.
Bits at 2 and 8 are 1, but bit 4 is 0, so "bird" is definitely not present.
Check "fox" — suppose hashes map to 1, 5, 8.
All are 1, so the filter says probably present — but maybe "fox" was never inserted. Those bits may have been set by other items. That is a false positive.
Why false positives happen
Section titled “Why false positives happen”Bloom filters do not store the actual items. They only store which bit positions were touched.
As more elements are added, more bits become 1. Eventually, different items start overlapping heavily. Then a new item may happen to hash only to positions that are already 1, even though it was never inserted. That is the entire reason false positives happen.
Approximate false-positive rate:
p ≈ (1 - e^(-k*n/m))^kWhere:
n= inserted itemsm= bit-array lengthk= number of hashes
Useful tuning formulas:
k_opt ≈ (m/n) * ln(2)m ≈ -(n * ln p) / (ln 2)^2Why there are no false negatives
Section titled “Why there are no false negatives”When you insert an item, its k bit positions are set to 1. A normal Bloom filter never clears bits.
Later, when you check that same item, those positions will still be 1. Therefore, once inserted, an item cannot be reported as absent. That is why a standard Bloom filter has no false negatives.
Design Trade-offs
Section titled “Design Trade-offs”- Larger
m-> lower false positives, higher memory use - Larger
k-> lower false positives up to a point, then extra CPU cost - Higher
nwith fixedm-> quickly rising false positives