Scaling an LSM-Tree: Thread-Safety and the Art of Non-Blocking

Moving from a single-threaded datastore to a high-throughput, concurrent LSM-tree isn't just about adding locks—it's about protecting invariants. Learn how to implement non-blocking reads and memtable rotation to build a database that is both thread-safe and fast.

Scaling an LSM-Tree: Thread-Safety and the Art of Non-Blocking

In its original form, my LSM-based datastore was single-threaded. Every operation was deterministic, easy to reason about, and correct. It was also slow. When I tried to scale it by allowing multiple client threads, correctness evaporated almost instantly. State corruption, subtle races, and occasional crashes made it painfully clear that concurrency isn’t something you sprinkle on top of a design after the fact.

This post is about the approach I ended up taking to make the system thread-safe and fast, practically usable under load. The journey involved some uncomfortable trade-offs and a noticeable increase in complexity — but also a massive improvement in throughput and stability.

Be prepared, this is a long one.

The Big Lock

The first idea that comes to mind for a lot of engineers when thinking thread safety is the blunt, but effective synchronize keyword: one global lock.

public synchronized void put(K key, V value) {
    // write to commit log
    // write to memtable
}

public synchronized V get(K key) {
    // read from memtable
    // read from SSTables
}

This guarantees mutual exclusion. Only one thread can be inside the database at any time. No races, no corruption, no surprises, and that’s exactly the problem.

With this approach:

  • Every read blocks every write
  • Every write blocks every read
  • Disk IO blocks everything

If a flush takes 50 ms, all other clients wait 50 ms. The database becomes a queue. It’s correct, but it’s barely multi-threaded in practice.

This approach will not scale, so let's start thinking about a more fine grained solution.

Defining Our North Stars: Goals and Invariants

Before looking at the code, we need to define exactly what we are trying to achieve. To build something better, I set three primary goals:

  1. Non-blocking Reads: A get() request should never be delayed because a write or a background flush is happening.
  2. Minimal Write Contention: Writers should only coordinate when absolutely necessary (like when the Memtable is full). Similar to reads writes should mostly be non blocking.
  3. Asynchronous Durability: Moving data from memory to a permanent SSTable (flushing) should happen in the background, away from the user's request path.

What is an "Invariant"?

In programming, an invariant is a property of a system that must always be true — no matter what else is happening.

Think of a bank account:

Invariant: balance = deposits − withdrawals

Even if ten people deposit money at the same time, the system must never reach a state — even briefly — where that equation doesn’t hold.

In concurrent systems, locks, atomics, and memory barriers exist to protect invariants.

The Invariants We Must Protect

To achieve our goals without returning wrong data or losing records, our LSM-tree must respect at least these two specific invariants:

1. The Read Visibility Invariant At any given microsecond, a key must be findable in at least one of three places: the active Memtable, the immutable queue, or an SSTable on disk. There can be no "dark period" during a rotation where a key exists in memory but the get() method can't see it.
2. The Atomic Swap Invariant A Memtable and its corresponding Commit Log (WAL) are a package deal. When we rotate to a new Memtable, we must ensure that the new writes are paired with a fresh Log segment simultaneously. We cannot have data in Memtable A pointing to Log Segment B; otherwise, a crash would leave our data unrecoverable.

From Blocking Flushes to Continuous Writes

In the original, single-threaded version of this datastore, life was simple. There was exactly one Memtable. When it filled up, the system did the only safe thing it could do:

  1. Stop accepting writes
  2. Flush the Memtable to disk as an SSTable
  3. Clear it
  4. Resume writes

This design had an obvious advantage: correctness was trivial. No other thread could observe the Memtable while it was being flushed, because no other thread existed.

The downside was equally obvious. Flushing was a full stop-the-world operation.

Why Non-Blocking Flushes Change Everything

The moment you decide that flushing must happen asynchronously to unblock writes, the single-Memtable design breaks down. You can’t flush a Memtable while it’s still being written to:

  • its contents are changing
  • iterators become invalid
  • you risk partial or duplicated data on disk

So the moment flushing moves to the background, a new requirement appears:

There must always be a Memtable available for incoming writes.

This is where Memtable rotation becomes unavoidable.

Memtable Rotation: The Enabler, Not the Optimization

Instead of one Memtable, the system now maintains two conceptual roles:

  • One active Memtable — always writable
  • Zero or more immutable Memtables — frozen snapshots waiting to be flushed
private final AtomicReference<Memtable<K, V>> activeMemtable;
private final Deque<ImmutableMemtable<K, V>> immutableMemtables;
💡
The active memtable is an AtomicReference, this so the reference change during rotation is visible to all threads.

When the active Memtable reaches its size limit:

  1. It is frozen — no more writes allowed
  2. A brand-new Memtable is created
  3. Writes immediately continue on the new Memtable
  4. The frozen Memtable is flushed in the background

This single change:

  • Removes flush latency from the write path - a background thread can make sure the frozen mamtable is persisted to disk into an SSTable, no need to block writes
  • Allows continuous ingestion - swapping the active memtable to a new one is very fats, writes can quickly continue
  • Keeps reads fully consistent

But it also introduces concurrency for the first time.

Why Rotation Is the Hard Part

In the single-threaded design, there was never more than one Memtable, and it was either writable or flushing — never both.

In the concurrent design:

  • one Memtable is being written to
  • several others may be flushing
  • readers must see all of them
  • writers must never write to the wrong one

Rotation is the moment where all of these concerns collide. That’s why most of the synchronization logic exists solely to make this transition safe.

The "Write-Side Only" Lock Pattern

In our LSM-tree we use a ReadWriteLock to manage the lifecycle of the Memtable. The "Readers" in our ReadWriteLock context are actually the Threads performing put() operations.

A ReadWrite lock is a Java synchroniztion mechanism with the rules of:

  • Read Access: Multiple threads can hold the read lock at once, provided no thread holds the write lock.
  • Write Access: Only one thread can hold the write lock, and no other threads (readers or writers) can be active during that time.

Here’s how the roles are redefined in our use case:

  • ReadLock (The "Ingestion" Lock): Held by any thread currently writing data into the active Memtable. Multiple threads can write simultaneously since Memtable implementation is now using the thread-safe ConcurrentSkipListMap as its datastructure.
  • WriteLock (The "Rotation" Lock): Held by a single thread that has decided the Memtable is full and needs to be swapped out for a fresh one. It is important that only 1 thread can be access the active memtable for the rotation.

Coordination Without Blocking Reads

When a put(key, value) call comes in, it acquires the readLock. This signals: "I am currently using the active Memtable; please do not swap it out from under me."

rotationLock.readLock().lock();
try {
    current = activeMemtable.get();
    current.put(record); // The "Read" lock is protecting this write operation!
} finally {
    rotationLock.readLock().unlock();
}

When the Memtable exceeds FLUSH_TO_DISK_LIMIT, we trigger maybeRotateMemtable. This is where the writeLock comes into play. By acquiring the writeLock, the rotation thread ensures that all "reader" threads (the active writers) have finished their current put() operations.

if (current.getSize() > FLUSH_TO_DISK_LIMIT) {
    maybeRotateMemtable(current);
}

....

private void maybeRotateMemtable(Memtable<K, V> activeMemtable) {
    rotationLock.writeLock().lock();
    try {
        if (activeMemtable.getSize() > FLUSH_TO_DISK_LIMIT) {
             rotateMemtable(activeMemtable);
        }
    } finally {
        rotationLock.writeLock().unlock();
    }
}

Doubled checked locking, and acquiring the write lock for active memtable rotation

The Atomic Swap and the Immutable Queue

Once the writeLock is secured, we perform the "magic" of the LSM-tree:

  1. Freeze: The full Memtable is marked immutable.
  2. Swap: A new, empty activeMemtable (and a fresh Commit Log segment) is created and set via an AtomicReference.
  3. Handoff: The old Memtable is pushed to the head of a ConcurrentLinkedDeque called immutableMemtables.

Maintaining the Read Visibility Invariant

Since the get() path is lock-free, we must be extremely careful about the order in which we search for data. If we aren't careful, a record could "disappear" for a few microseconds during the swap.

To prevent this, the get() method follows a strict Newest-to-Oldest hierarchy:

  1. The Active Memtable: We check the AtomicReference first.
  2. The Immutable Queue: We iterate through immutableMemtables. Because we use addFirst(), the most recently retired memtables are checked first.
  3. SSTables: Finally, we hit the disk via the ssTableManager.

By the time a Memtable is removed from the immutableMemtables deque, its data is guaranteed to be persisted in an SSTable, maintaining a seamless "hand-off" of data visibility.

Asynchronous flushing

Flushing is the process of serializing a memtable to disc to become an SSTable. Once we implemented the above describe memtable rotation and start to gather the now immutable memtables into a queue we need a backround thread to poll this quesus and do the actual flushing.

We do this with a single application level background thread.

private final ExecutorService flushExecutor;

...

this.flushExecutor = Executors.newSingleThreadExecutor(r -> {
    Thread t = new Thread(r, "lsm-flush");
    t.setDaemon(true);
    return t;
});

A flush job is submitted to this executor thread as the last step of the memtable rotation, so normally the newly created immutbale memtable is mmediately flushed and won't stay in the queue for long.

private void rotateMemtable(Memtable<K, V> memtable) {
   ...
   flushExecutor.submit(() -> flushImmutableMemtable(immutable));
}

private void flushImmutableMemtable(ImmutableMemtable<K, V> imm) {
    try {
        log.debug("Flushing immutable memtable with segment: {}", imm.getCommitLogSegmentId());

        ssTableManager.flush(imm);

        imm.deleteCommitLogSegment();
        immutableMemtables.remove(imm);

        log.debug("Successfully flushed and removed immutable memtable");
    } catch (IOException e) {
        log.error("Flush failed for segment: {}", imm.getCommitLogSegmentId(), e);
    }
}

Summary

Making something thread safe is rarely trivial and adds a lot of complexity, but is necessary for any real world application.

The first step was to make reads and writes non blocking and move all operations with disk I/O to background threads. In this article I described some core concepts and showed hot the memtable rotation and async memtable -> SSTable flush was implemented.

I will have a follow up article about how the comit log is being persisted asyncronously and what further performance tuning has been done.

The updated codebase is available in GitHub.