Scaling an LSM-Tree: Thread-Safety and the Art of Non-Blocking
Moving from a single-threaded datastore to a high-throughput, concurrent LSM-tree isn't just about adding locks—it's about protecting invariants. Learn how to implement non-blocking reads and memtable rotation to build a database that is both thread-safe and fast.
In its original form, my LSM-based datastore was single-threaded. Every operation was deterministic, easy to reason about, and correct. It was also slow. When I tried to scale it by allowing multiple client threads, correctness evaporated almost instantly. State corruption, subtle races, and occasional crashes made it painfully clear that concurrency isn’t something you sprinkle on top of a design after the fact.
This post is about the approach I ended up taking to make the system thread-safe and fast, practically usable under load. The journey involved some uncomfortable trade-offs and a noticeable increase in complexity — but also a massive improvement in throughput and stability.
Be prepared, this is a long one.
The Big Lock
The first idea that comes to mind for a lot of engineers when thinking thread safety is the blunt, but effective synchronize keyword: one global lock.
public synchronized void put(K key, V value) {
// write to commit log
// write to memtable
}
public synchronized V get(K key) {
// read from memtable
// read from SSTables
}This guarantees mutual exclusion. Only one thread can be inside the database at any time. No races, no corruption, no surprises, and that’s exactly the problem.
With this approach:
- Every read blocks every write
- Every write blocks every read
- Disk IO blocks everything
If a flush takes 50 ms, all other clients wait 50 ms. The database becomes a queue. It’s correct, but it’s barely multi-threaded in practice.

This approach will not scale, so let's start thinking about a more fine grained solution.
Defining Our North Stars: Goals and Invariants
Before looking at the code, we need to define exactly what we are trying to achieve. To build something better, I set three primary goals:
- Non-blocking Reads: A
get()request should never be delayed because a write or a background flush is happening. - Minimal Write Contention: Writers should only coordinate when absolutely necessary (like when the Memtable is full). Similar to reads writes should mostly be non blocking.
- Asynchronous Durability: Moving data from memory to a permanent SSTable (flushing) should happen in the background, away from the user's request path.
What is an "Invariant"?
In programming, an invariant is a property of a system that must always be true — no matter what else is happening.
Think of a bank account:
Invariant: balance = deposits − withdrawals
Even if ten people deposit money at the same time, the system must never reach a state — even briefly — where that equation doesn’t hold.
In concurrent systems, locks, atomics, and memory barriers exist to protect invariants.
The Invariants We Must Protect
To achieve our goals without returning wrong data or losing records, our LSM-tree must respect at least these two specific invariants:
1. The Read Visibility Invariant At any given microsecond, a key must be findable in at least one of three places: the active Memtable, the immutable queue, or an SSTable on disk. There can be no "dark period" during a rotation where a key exists in memory but the get() method can't see it.2. The Atomic Swap Invariant A Memtable and its corresponding Commit Log (WAL) are a package deal. When we rotate to a new Memtable, we must ensure that the new writes are paired with a fresh Log segment simultaneously. We cannot have data in Memtable A pointing to Log Segment B; otherwise, a crash would leave our data unrecoverable.
From Blocking Flushes to Continuous Writes
In the original, single-threaded version of this datastore, life was simple. There was exactly one Memtable. When it filled up, the system did the only safe thing it could do:
- Stop accepting writes
- Flush the Memtable to disk as an SSTable
- Clear it
- Resume writes
This design had an obvious advantage: correctness was trivial. No other thread could observe the Memtable while it was being flushed, because no other thread existed.
The downside was equally obvious. Flushing was a full stop-the-world operation.
Why Non-Blocking Flushes Change Everything
The moment you decide that flushing must happen asynchronously to unblock writes, the single-Memtable design breaks down. You can’t flush a Memtable while it’s still being written to:
- its contents are changing
- iterators become invalid
- you risk partial or duplicated data on disk
So the moment flushing moves to the background, a new requirement appears:
There must always be a Memtable available for incoming writes.
This is where Memtable rotation becomes unavoidable.
Memtable Rotation: The Enabler, Not the Optimization
Instead of one Memtable, the system now maintains two conceptual roles:
- One active Memtable — always writable
- Zero or more immutable Memtables — frozen snapshots waiting to be flushed
private final AtomicReference<Memtable<K, V>> activeMemtable;
private final Deque<ImmutableMemtable<K, V>> immutableMemtables;When the active Memtable reaches its size limit:
- It is frozen — no more writes allowed
- A brand-new Memtable is created
- Writes immediately continue on the new Memtable
- The frozen Memtable is flushed in the background
This single change:
- Removes flush latency from the write path - a background thread can make sure the frozen mamtable is persisted to disk into an SSTable, no need to block writes
- Allows continuous ingestion - swapping the active memtable to a new one is very fats, writes can quickly continue
- Keeps reads fully consistent
But it also introduces concurrency for the first time.
Why Rotation Is the Hard Part
In the single-threaded design, there was never more than one Memtable, and it was either writable or flushing — never both.
In the concurrent design:
- one Memtable is being written to
- several others may be flushing
- readers must see all of them
- writers must never write to the wrong one
Rotation is the moment where all of these concerns collide. That’s why most of the synchronization logic exists solely to make this transition safe.
The "Write-Side Only" Lock Pattern
In our LSM-tree we use a ReadWriteLock to manage the lifecycle of the Memtable. The "Readers" in our ReadWriteLock context are actually the Threads performing put() operations.
A ReadWrite lock is a Java synchroniztion mechanism with the rules of:
- Read Access: Multiple threads can hold the read lock at once, provided no thread holds the write lock.
- Write Access: Only one thread can hold the write lock, and no other threads (readers or writers) can be active during that time.
Here’s how the roles are redefined in our use case:
- ReadLock (The "Ingestion" Lock): Held by any thread currently writing data into the active Memtable. Multiple threads can write simultaneously since
Memtableimplementation is now using the thread-safeConcurrentSkipListMapas its datastructure. - WriteLock (The "Rotation" Lock): Held by a single thread that has decided the Memtable is full and needs to be swapped out for a fresh one. It is important that only 1 thread can be access the active memtable for the rotation.
Coordination Without Blocking Reads
When a put(key, value) call comes in, it acquires the readLock. This signals: "I am currently using the active Memtable; please do not swap it out from under me."
rotationLock.readLock().lock();
try {
current = activeMemtable.get();
current.put(record); // The "Read" lock is protecting this write operation!
} finally {
rotationLock.readLock().unlock();
}
When the Memtable exceeds FLUSH_TO_DISK_LIMIT, we trigger maybeRotateMemtable. This is where the writeLock comes into play. By acquiring the writeLock, the rotation thread ensures that all "reader" threads (the active writers) have finished their current put() operations.
if (current.getSize() > FLUSH_TO_DISK_LIMIT) {
maybeRotateMemtable(current);
}
....
private void maybeRotateMemtable(Memtable<K, V> activeMemtable) {
rotationLock.writeLock().lock();
try {
if (activeMemtable.getSize() > FLUSH_TO_DISK_LIMIT) {
rotateMemtable(activeMemtable);
}
} finally {
rotationLock.writeLock().unlock();
}
}Doubled checked locking, and acquiring the write lock for active memtable rotation
The Atomic Swap and the Immutable Queue
Once the writeLock is secured, we perform the "magic" of the LSM-tree:
- Freeze: The full Memtable is marked immutable.
- Swap: A new, empty
activeMemtable(and a fresh Commit Log segment) is created and set via anAtomicReference. - Handoff: The old Memtable is pushed to the head of a
ConcurrentLinkedDequecalledimmutableMemtables.
Maintaining the Read Visibility Invariant
Since the get() path is lock-free, we must be extremely careful about the order in which we search for data. If we aren't careful, a record could "disappear" for a few microseconds during the swap.
To prevent this, the get() method follows a strict Newest-to-Oldest hierarchy:
- The Active Memtable: We check the
AtomicReferencefirst. - The Immutable Queue: We iterate through
immutableMemtables. Because we useaddFirst(), the most recently retired memtables are checked first. - SSTables: Finally, we hit the disk via the
ssTableManager.
By the time a Memtable is removed from the immutableMemtables deque, its data is guaranteed to be persisted in an SSTable, maintaining a seamless "hand-off" of data visibility.
Asynchronous flushing
Flushing is the process of serializing a memtable to disc to become an SSTable. Once we implemented the above describe memtable rotation and start to gather the now immutable memtables into a queue we need a backround thread to poll this quesus and do the actual flushing.
We do this with a single application level background thread.
private final ExecutorService flushExecutor;
...
this.flushExecutor = Executors.newSingleThreadExecutor(r -> {
Thread t = new Thread(r, "lsm-flush");
t.setDaemon(true);
return t;
});A flush job is submitted to this executor thread as the last step of the memtable rotation, so normally the newly created immutbale memtable is mmediately flushed and won't stay in the queue for long.
private void rotateMemtable(Memtable<K, V> memtable) {
...
flushExecutor.submit(() -> flushImmutableMemtable(immutable));
}
private void flushImmutableMemtable(ImmutableMemtable<K, V> imm) {
try {
log.debug("Flushing immutable memtable with segment: {}", imm.getCommitLogSegmentId());
ssTableManager.flush(imm);
imm.deleteCommitLogSegment();
immutableMemtables.remove(imm);
log.debug("Successfully flushed and removed immutable memtable");
} catch (IOException e) {
log.error("Flush failed for segment: {}", imm.getCommitLogSegmentId(), e);
}
}Summary
Making something thread safe is rarely trivial and adds a lot of complexity, but is necessary for any real world application.
The first step was to make reads and writes non blocking and move all operations with disk I/O to background threads. In this article I described some core concepts and showed hot the memtable rotation and async memtable -> SSTable flush was implemented.
I will have a follow up article about how the comit log is being persisted asyncronously and what further performance tuning has been done.
The updated codebase is available in GitHub.