Skip to content
Dev Dump

LSM Tree Database Architecture

In a database, we need to support two fundamental operations: read and write. Let’s build one from scratch, starting simple and evolving it to handle real-world scale.


The simplest approach: maintain an in-memory map.

Map<Integer, String> data = new TreeMap<>();
// Write
data.put(5, "Alice");
// Read
String value = data.get(5); // O(log n)

When a user writes data, we insert it into the map. When they read, we look it up and return it.

For a small dataset, this works perfectly. Reads and writes are both fast.

But there is a problem.


What happens if the system crashes?

If everything is stored only in memory, a crash wipes out all recent writes. That’s unacceptable for any real system.

The fix: Before writing to memory, append every incoming write to a file on disk. This file is sequential and append-only, which makes it very fast to write. Only after the write is safely recorded do we update the in-memory structure.

We call this file the Write-Ahead Log (WAL).

WAL on disk:
[put(5, "Alice"), put(2, "Bob"), put(8, "Carol"), delete(5)]

LSM Write Path

This changes the guarantees completely:

  • Even if the process crashes, the WAL contains the history of operations
  • On restart, we replay the WAL and rebuild the in-memory state
  • Each write does two things: persisted in WAL for durability, stored in memory for fast access

What’s next? The system is fast, simple, and durable. But when real scale kicks in, memory cannot hold everything.


Beyond a certain point, we need to flush data from memory to disk.

We want to keep data sorted in the disk file so lookups are O(log n) binary search, not O(n) linear scan. For this reason, our in-memory structure is an ordered map (like TreeMap or a skip list), making the flush operation cheap.

We call the in-memory ordered structure a Memtable. It holds the most recent writes and is optimized for fast access.

When the Memtable grows beyond a size threshold, we flush it to disk as a Sorted String Table (SSTable). The SSTable is immutable—once written, it is never modified.

Memtable (in memory, sorted)
↓ flush when full
SSTable (on disk, sorted, immutable)
  • No in-place updates = no fragmentation
  • Simpler crash recovery
  • Sequential write = cheap
  • Binary search possible within the file
  • Merge operations become efficient (we’ll need this later)

Suppose we receive: put(5), put(2), put(8), put(3) (values omitted for simplicity).

Each operation is first written to the WAL, then inserted into the Memtable. Since the Memtable is ordered, it ends up as:

Memtable: [2, 3, 5, 8]

Say the Memtable reaches its size threshold and flushes to disk:

SSTable-1: [2, 3, 5, 8]
Memtable: [] (empty, ready for new writes)

Now more writes arrive: put(1), put(7), put(3).

These go into the Memtable:

Memtable: [1, 3, 7]
SSTable-1: [2, 3, 5, 8]

The system is now split across two places. The Memtable contains the most recent data, and the SSTable contains older data.


When a user asks for a key, we always start with the Memtable—it contains the freshest data. If the key is found, return immediately. If not, check SSTables from newest to oldest.

Using our example state:

Memtable: [1, 3, 7]
SSTable-1: [2, 3, 5, 8]

Query: key = 3

  • Found in Memtable → return that value
  • SSTable-1 also has key 3, but it’s older and ignored

Query: key = 5

  • Not in Memtable
  • Found in SSTable-1 → return that value

The rule: Always prefer newer data. The first match wins.


Deletes introduce an interesting challenge. Since SSTables are immutable, we cannot go back and remove a key from disk.

Solution: Record the intent to delete. When a user deletes a key, we write a special marker called a tombstone. This tombstone is treated like any other write—it goes into the WAL and into the Memtable.

// Delete key 3
memtable.put(3, TOMBSTONE);

Consider this state:

Memtable: [3 → TOMBSTONE]
SSTable-1: [2, 3, 5, 8]

Query: key = 3

  • Memtable has a tombstone → key is deleted
  • We do NOT continue searching older SSTables

The data still physically exists on disk, but it’s logically invisible. We don’t clean up immediately—we record the latest truth and reconcile later.


As more data flows through, we keep flushing Memtables to disk. Since SSTables are immutable, we create new files each time.

After another flush:

Memtable: []
SSTable-2: [1, 3, 7]
SSTable-1: [2, 3, 5, 8]

Now imagine reading key 3. It appears in multiple files. We must check newest first and stop as soon as we find a definitive answer.

This works, but it’s starting to feel inefficient. More SSTables = more files to check.


Instead of letting SSTables accumulate indefinitely, we periodically merge them. While merging, we clean up redundant data:

  • If the same key appears multiple times, keep only the newest version
  • If a key has a tombstone, remove older values during merge

Compaction is not just about reducing files—it’s how the system corrects itself over time.

LSM Compaction

Because every SSTable is sorted, merging is efficient—similar to the merge step in merge sort. We walk through files in order and produce a new sorted output.

Example: Duplicate Keys

SSTable-1 (older): [2, 3, 5, 8]
SSTable-2 (newer): [1, 3, 7]

When key 3 appears in both, we keep the newer value:

Merged result: [1, 2, 3, 5, 7, 8]

Example: With Tombstone

SSTable-1 (older): [2, 3, 5, 8]
SSTable-2 (newer): [3 → TOMBSTONE, 7]

The tombstone takes precedence. Key 3 disappears from final output:

Merged result: [2, 5, 7, 8]

This is when deletion becomes permanent. Until compaction, old values still exist on disk. After compaction, they’re gone.


It might seem tempting to merge everything into a single SSTable. But that doesn’t scale—merging large files repeatedly is expensive.

Instead, we organize SSTables into levels and compact gradually.

L0 (Memtable): active memory layer
L1: newest SSTables, small
L2: older, medium
L3: oldest, large
...

LSM Levels

New SSTables are created at L1. As files accumulate, they’re merged and pushed into deeper levels. Each level contains larger, more compacted data than the one above.

The flow: Data starts small and recent, gradually becoming larger and more stable as it moves down.


A new inefficiency appears. Searching for a key that doesn’t exist (say, key 42) might require scanning SSTables at multiple levels before concluding it’s not present.

Most of these checks are wasted work. What if we could know whether a key might exist in an SSTable without reading it?

Enter Bloom filters.

A Bloom filter is a compact bit array with hash functions. It answers: “Could this key exist in this file?”

  • “No” → Definitely not present. Skip this file.
  • “Maybe” → Possibly present. Check the file.

Bloom Filter

Think of a Bloom filter as a bit array, initially all zeros:

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

We have two hash functions, hashA and hashB.

Insert key 2:

hashA(2) → position 3
hashB(2) → position 7

Flip those bits:

[0, 0, 0, 1, 0, 0, 0, 1, 0, 0]

Insert key 8:

hashA(8) → position 2
hashB(8) → position 7

Updated array:

[0, 0, 1, 1, 0, 0, 0, 1, 0, 0]

Query: Does key 3 exist?

hashA(3) → position 1
hashB(3) → position 7

Check the array:

  • Position 1 → 0
  • Position 7 → 1

Since at least one position is 0, key 3 is definitely not present.

Each SSTable maintains its own Bloom filter. Before reading a file, we consult its filter first.


Even with Bloom filters, reading an entire SSTable for one key is wasteful.

Solution: Divide each SSTable into smaller blocks. Maintain a compact in-memory sparse index that points to these blocks.

The index stores a sparse set of keys (e.g., every 100th key) with pointers to their blocks on disk. Think of it as a table of contents, not the full book.

Sparse Index:
key 100 → Block 1
key 200 → Block 2
key 300 → Block 3

Lookup for key 175:

  1. Consult sparse index → closest match is key 100 (Block 1) or key 200 (Block 2)
  2. Jump directly to Block 2
  3. Search only within that block

Blocks are small and sorted, so in-memory search is fast. We don’t load the entire SSTable, yet lookups are predictable.


With all pieces in place, a read becomes a structured process:

Read(key):
1. Check Memtable (freshest data)
2. For each level (newest to oldest):
a. Consult Bloom filter
b. If "no" → skip this SSTable
c. If "maybe" → use sparse index to find block
d. Search block for key
3. Return first match (or tombstone = deleted)

LSM Read Path

L0 (Memtable): empty
L1: [10, 20, 30]
L2: [5, 20, 25, 35, 45, 55]
L3: [1, 2, 3, 10, 15, 20, 21, 22, 23, 24]

Query: key = 20

  • L1 Bloom filter: maybe present
  • Find key 20 in L1 → return immediately

Query: key = 25

  • L1 Bloom filter: not present → skip
  • L2 Bloom filter: maybe present
  • Find key 25 in L2 → return

Query: key = 99

  • All Bloom filters say “not present”
  • Return “not found” without touching disk blocks

Query: key = 20 (with tombstone)

If a newer SSTable contains 20 → TOMBSTONE while an older one has the value:

  • Encounter tombstone first → key is deleted
  • Older value is never considered

What we’ve built is not theoretical. This is the soul of:

System TypeExamples
Storage EnginesRocksDB, LevelDB
Distributed DatabasesApache Cassandra, Apache HBase
Streaming/LoggingApache Kafka, Apache Pulsar
AnalyticsClickHouse, Bigtable

The details vary and optimizations are layered on top, but the core idea remains the same.

ConceptRocksDB/LevelDBCassandraHBase
Durability logWALCommit LogWAL
In-memory bufferMemtableMemtableMemStore
Disk fileSSTableSSTableHFile

An LSM Tree is a write-optimized storage architecture where:

  1. Writes go to a WAL for durability
  2. Then to a Memtable in memory (sorted)
  3. Memtable flushes to immutable, sorted SSTables on disk
  4. Multiple SSTables are merged through compaction
  5. Reads are optimized using Bloom filters and sparse indexes

One-line summary: LSM Trees convert random writes into sequential writes, then use compaction and indexing to make reads efficient later.

AspectAdvantageTrade-off
WritesSequential, very fastCompaction overhead
ReadsBloom filters skip filesMay check multiple levels
SpaceCompaction reclaimsTemporary duplication
DeletesImmediate logical deletePhysical cleanup delayed

Final Intuition

  • WAL = safety logbook
  • Memtable = active working desk
  • SSTables = organized filing cabinets
  • Compaction = background cleanup and reorganization

New data arrives quickly, gets stored safely, and is organized later without slowing down the write path.