Bloom Filter Essentials

What a Bloom Filter Does

A Bloom filter is a very compact data structure for answering one question:

“Have I seen this item before?”

It is designed for fast membership tests with a tradeoff:

It can say “definitely not present” or “probably present.”
No false negatives: if it says “not present,” that is guaranteed correct.
Possible false positives: it may say “present” even when the item was never added.

That tradeoff is what makes Bloom filters so space-efficient.

It does not support exact membership, and the basic form does not support safe deletes.

Bloom filter membership flow

Core idea

A Bloom filter is:

a bit array of size m
k hash functions

Initially, all bits are 0.

To add an item

For an item like "cat":

Run it through k hash functions.
Each hash gives a position in the bit array.
Set those positions to 1.

To check an item

For an item like "cat":

Run the same k hash functions.
Look at those k positions.
If any of them is 0, the item is definitely not in the set.
If all of them are 1, the item is probably in the set.

Small example

Say the filter has 10 bits:

0000000000

Use 3 hash functions.

Insert "cat" — suppose the hashes map to positions 2, 5, 8:

0010010010

Insert "dog" — suppose hashes map to 1, 5, 9:

0110010011

Check "bird" — suppose hashes map to 2, 4, 8.

Bits at 2 and 8 are 1, but bit 4 is 0, so "bird" is definitely not present.

Check "fox" — suppose hashes map to 1, 5, 8.

All are 1, so the filter says probably present — but maybe "fox" was never inserted. Those bits may have been set by other items. That is a false positive.

Why false positives happen

Bloom filters do not store the actual items. They only store which bit positions were touched.

As more elements are added, more bits become 1. Eventually, different items start overlapping heavily. Then a new item may happen to hash only to positions that are already 1, even though it was never inserted. That is the entire reason false positives happen.

Approximate false-positive rate:

p ≈ (1 - e^(-k*n/m))^k

Where:

n = inserted items
m = bit-array length
k = number of hashes

Useful tuning formulas:

k_opt ≈ (m/n) * ln(2)
m ≈ -(n * ln p) / (ln 2)^2

Bloom filter false-positive tuning

Why there are no false negatives

When you insert an item, its k bit positions are set to 1. A normal Bloom filter never clears bits.

Later, when you check that same item, those positions will still be 1. Therefore, once inserted, an item cannot be reported as absent. That is why a standard Bloom filter has no false negatives.

Design Trade-offs

Larger m -> lower false positives, higher memory use
Larger k -> lower false positives up to a point, then extra CPU cost
Higher n with fixed m -> quickly rising false positives