Skip to content
Dev Dump

Bloom Filter Essentials

A Bloom filter is a very compact data structure for answering one question:

“Have I seen this item before?”

It is designed for fast membership tests with a tradeoff:

  • It can say “definitely not present” or “probably present.”
  • No false negatives: if it says “not present,” that is guaranteed correct.
  • Possible false positives: it may say “present” even when the item was never added.

That tradeoff is what makes Bloom filters so space-efficient.

It does not support exact membership, and the basic form does not support safe deletes.

Bloom filter membership flow

A Bloom filter is:

  • a bit array of size m
  • k hash functions

Initially, all bits are 0.

For an item like "cat":

  1. Run it through k hash functions.
  2. Each hash gives a position in the bit array.
  3. Set those positions to 1.

For an item like "cat":

  1. Run the same k hash functions.
  2. Look at those k positions.
  3. If any of them is 0, the item is definitely not in the set.
  4. If all of them are 1, the item is probably in the set.

Say the filter has 10 bits:

0000000000

Use 3 hash functions.

Insert "cat" — suppose the hashes map to positions 2, 5, 8:

0010010010

Insert "dog" — suppose hashes map to 1, 5, 9:

0110010011

Check "bird" — suppose hashes map to 2, 4, 8.

Bits at 2 and 8 are 1, but bit 4 is 0, so "bird" is definitely not present.

Check "fox" — suppose hashes map to 1, 5, 8.

All are 1, so the filter says probably present — but maybe "fox" was never inserted. Those bits may have been set by other items. That is a false positive.

Bloom filters do not store the actual items. They only store which bit positions were touched.

As more elements are added, more bits become 1. Eventually, different items start overlapping heavily. Then a new item may happen to hash only to positions that are already 1, even though it was never inserted. That is the entire reason false positives happen.

Approximate false-positive rate:

p ≈ (1 - e^(-k*n/m))^k

Where:

  • n = inserted items
  • m = bit-array length
  • k = number of hashes

Useful tuning formulas:

k_opt ≈ (m/n) * ln(2)
m ≈ -(n * ln p) / (ln 2)^2

Bloom filter false-positive tuning

When you insert an item, its k bit positions are set to 1. A normal Bloom filter never clears bits.

Later, when you check that same item, those positions will still be 1. Therefore, once inserted, an item cannot be reported as absent. That is why a standard Bloom filter has no false negatives.

  • Larger m -> lower false positives, higher memory use
  • Larger k -> lower false positives up to a point, then extra CPU cost
  • Higher n with fixed m -> quickly rising false positives