Building a Database series
Discover how I built an LSM tree key-value store, optimized performance with Bloom filters, and explored efficient file I/O. Follow my journey into compaction, concurrency, transactions, and distributed systems.
Some time ago, I wrote an article about implementing a log-structured merge tree-based key-value store. Since then, I’ve had new ideas and the motivation to continue working on that project.
My original goal was to understand the data structure and algorithms involved, but as I wrote more articles, the focus shifted toward performance improvements and more complex topics. Learning about these optimizations turned out to be just as enjoyable as the initial exploration. I plan to keep improving the project, which means it will become increasingly complex. To help others follow along, I’ll use tags in the repository to mark key points in the commit history corresponding to each article. This way, anyone interested in earlier, simpler stages that match a specific article can easily find them.
I'll use this page as a central hub for this project and update it when new related articles come out.
Topics Covered So Far:
- Building an LSM Tree key-value store
- Building a B-Tree index
- Using generics to make the database more flexible
- Adding a Bloom filter to the LSM tree to improve performance
- Enhancing performance through more efficient file I/O
I've also put together a roadmap of topics I’d like to explore in the future. This isn’t a strict plan—there’s no set order, and I don’t have deadlines. It’s more of a collection of interesting areas to dive into.
Future Exploration:
Data Representation and Storage
- Byte-level storage and serialization – Right now, I store data as strings for readability, but this isn’t efficient. I want to explore better storage formats.
- Data compression – SSTables are often compressed to reduce space usage and improve I/O efficiency.
Compaction Strategies
- Tiered vs. leveled compaction – I implemented a basic compaction algorithm to grasp the concepts, but there’s plenty of room for improvement.
- Compaction scheduling and resource management – When and how should compaction be performed efficiently?
Concurrency and Transactions
- Read/write concurrency control – The database can be used in a multi-threaded environment, so ensuring correctness is crucial. There might also be opportunities to leverage concurrency for parallel tasks.
- Basic transaction support – It’s worth exploring how to implement transaction management while maintaining ACID properties.
Querying and Indexing
- Range queries – Since data is stored in sorted order, range queries can be highly efficient in LSM trees. Implementing this functionality is a logical next step.
- Adding a simple query language – Clients need an interface to interact with the database. Designing a minimal query language could be an interesting challenge.
Client Interaction and Configuration
- Building a simple client library – Most databases aren’t embedded directly but follow a client-server model. Developing this architecture could provide valuable insights.
Advanced Topics
- Distributed LSM trees – Distributed systems are complex but fascinating. I’d love to experiment with adding distributed capabilities and exploring algorithms like the gossip protocol.