⦿¬
(checkers not chess)

store pt. 1

(mechanical sympathy)

So I've been working on a database system (creatively named "store") for a few months now, and I think it's far along enough that it's time to give a general conceptual overview for what the system is, and how it will work. A good analogy for the system would be something like a server version of RocksDB, or a single node (for now) version of something like TiKV. It's designed specifically for oltp workloads, and will most likely be used as a "storage node" in a larger fully fledged distributed database. In this post, I want to give an overview of the general architecture, and the guiding principles that I've had when implementing the system.

The core idea for me here is something called "mechanical sympathy". It's an idea that comes from race car driving, but it's been used some in the context of software engineering, and I think it captures a lot of valuable ideas in a simple way. With respect to racing, the idea is that the driver only has one mechanism for completing their goal: the car. Yes, conceptually, the idea is to get around the track the fastest, but for the driver specifically, the task is slightly more detailed: get around the track as fast as you can with the car that you have. This means they may need to do some unintuitive, or generally frowned upon things because they suit the strengths and weaknesses of the vehicle.

The same idea makes a lot of sense when applied to software. Fundamentally, software engineering is about "getting computers to do stuff". And while the goal may be presented in a more esoteric or abstract way as some logical problem to solve, the medium we have to achieve that goal is the machine, and we need to respect the characteristics of our hardware when we implement our systems. With that in mind, let's talk about store!

Log structuring for modern storage devices

One of the first things to do with this kind of system is choose the core indexing structure that maps keys to values. This is kind of the core decision to make around the "logical" functioning of the system. Fundamentally, all we're really doing is just shuttling bytes back and forth between some files and some network connections, everything in between is just about organizing the data effectively to make that process efficient and follow certain constraints (we'll talk about transactions in a bit). Specifically, we want to support general oltp workloads well, meaning we can't really make any assumptions about how read vs write heavy our workloads will be. We'd also like range scans to work reasonably well. We can assume that we won't be dealing with huge range scans (not an analytics system), but small range scans are definitely crucial for oltp workloads.

Generally speaking, there are two major options in this space: b-trees and LSM trees. Based on the constraints I just laid out, you might think LSM trees wouldn't be the right choice, after all, LSM trees are for write-heavy workloads, right? While they certainly excel in these settings, they've been gaining in popularity recently, and I think the reason for this is slightly more nuanced. Specifically, we have a fundamental change in how our storage devices work. Disk IO is probably the most crucial thing to optimize for in a database, and "traditional" databases (sqlite, postgres, etc) generally use btrees because they are uniquely suited for both the logical operations of a database, and the performance constraints of spinning disk drives. With HDDs, the real performance killer is disk seeks, having a physical arm that needs to move makes random access substantially slower than sequential access, but reads vs writes don't have a huge differential in performance. B-trees, with leaf pages laid out sequentially, gives systems much more opportunity to do sequential io, and mitigate the cost of seeks.

With modern SSDs, though, the story is different. There's no more physical movement happening, and in general, performance for all kinds of access patterns will be much faster than on an HHD. That being said, they do still have some idiosyncrasies, and the nature of their performance imbalance is quite different. Specifically, there isn't a huge gap between random and sequential access, but writes can be (relatively) much slower than reads. This is because the flash cells on SSDs can "wear out" over time. Conceptually, you can think of a flash cell of having some upper limit on the number of times you can write to it before it stops working. To mitigate this, SSDs typically do a lot of extra internal bookkeeping to spread this wear and tear across different flash cells, in an effort to make the device last longer. The result is that all of this extra work the device has to do for a write, makes any given individual write have a much higher overhead. So we want to support a general workload, without making any assumptions on read vs write heaviness, but we have a storage medium that performs much worse for writes than for reads. We need some way to mitigate this imbalance. LSM trees, and log structuring in general, are a great way to do this. By batching up our writes into large sequential chunks, we can amortize the cost of physically writing across many individual write operations.

That being said, LSM trees were originally designed for write-heavy workloads, and while their architecture generally mitigates the issues with SSDs well, they make a lot of other trade offs that might not make sense in our case. Specifically, reads and range scans can suffer quite a bit. For reads, the data in LSM trees is organized first by recency, not by key. This means if we issue a read for a key that was written quite a while ago, we may have to read many files (and thus issue many read ios). Range scans suffer in a similar way, keys in the same range can be scattered across many files. All of these extra reads can clog up our io bandwidth, and add significant latency to both point reads and range scans (people usually call this "read amplification"). Some read amplification is not the end of the world when dealing with SSDs, after all, the random read performance is quite good, but my intuition is that LSM trees go a bit too far here. While there are ways to mitigate these issues (bloom filters, organizing individual files by key, etc), to me these feel more like stop gaps than a fundamentally sound architecture for the workload we're trying to support. Ideally, we could keep the logical structure of a btree (and all of the benefits that come with that), and still have an io pattern that pairs well with the characteristics of modern storage devices.

Lucky for us, there is some work being done in this direction. Specifically, the bw-tree, out of Microsoft Research, is more or less a "log structured b-tree". The design in the paper is focused more on building a lock-free concurrent b-tree, that just happens to use log structuring to facilitate this. I have a different approach to how I want to handle concurrency (more on that in a bit), so we can ignore all of the concurrency components for now, and just focus on the general structure. Essentially, the data is still logically laid out as a b-tree, but when we write to a page, rather than doing a write in place, we prepend a "delta" to the page that describes the write. This allows us to batch up our write operations as a sequence of deltas, giving us the batched write io patterns that we wanted from the LSM tree. Eventually, a page will fill up, and at that point, we compact it into a "base page", which looks more or less like a typical b-tree page, but with a little less bookkeeping data, since we don't need to do in place updates. As a brief aside, having compaction be granular to pages like this should give us some big wins, and make our implementation much more elegant. A major piece of the work that needs to be done in LSM tree systems is compaction, and it tends to be a very expensive process that needs a lot of careful consideration with respect to the amount of resources being spent on it. This page based compaction should give allow us to naturally spread out the compaction work in a fairly elegant way, without needing the complexity of something like a scheduler.

Respect the almighty cache

So what about concurrency? I mentioned that we won't be using the lock-free techniques from the bw-tree paper, but obviously we still want to make use of all of our cpu cores. In general, the best way to deal with concurrency efficiently is to keep contention as low as possible. For most concurrent systems, the fundamental challenge is the coordination of accessing shared data, and the less data you can share among multiple cores, the more efficient and simple this process can be. If we take this idea to the extreme, if we have a way to simply partition all of our resources, we can essentially eliminate coordination entirely by not actually having any shared resources. That's the plan for this system. In general, this approach is usually called "thread per core", but I kind of think that's a pretty bad name, because most concurrent systems will spawn one thread per core (tokio, go's runtime, etc), as there is minimal benefit to having more threads than cores (the small benefit would be simplicity or implementation speed, as spawning more threads than cores essentially just means you've delegated the task of scheduling concurrent jobs to the operating system). So for distinction, I'm going to refer to this approach as "shard per core" from now on, indicating that resources are being partitioned across cores.

So we need a fast way to "spread out" both work and data amongst our cores, and we want this partitioning to be nice and even, we don't want one core doing a ton more work or having a lot more data than another. Partitioning the data in our case is really simple, we can just partition by the hash of the key! If we choose a good hash function, finding the right core for a piece of data is both fast, and should give us a pretty even spread of data across our shards. It's also pretty easy to make this process deterministic, which should help a lot with testing/debugging. Partitioning work is slightly more difficult, but not impossible. One reasonable approximation for the amount of "work" a shard has in this case is just how many connections it's handling. Obviously this isn't a perfect metric, some connections will be querying more than others, but for oltp workloads like the one's were targeting, they should be pretty close. Now the task is just to load balance connections across cores. This is a well studied problem, and there are a lot of options at our disposal. The specific load balancing algorithm isn't crucial to discuss here, but mechanically, we will have an authoritative "main thread" that runs on it's own core, which will be responsible for this load balancing. Of course sometimes we'll also need to coordinate some amount across shards, the main thread will play a big role in this, but we'll also just be using message passing with fifo queues between cores. Both of these pieces make a lot more sense in the context of transactions though, so I'll talk about them more in that section.

So now that we understand how the concurrency will work, how do we make the individual shards perform really well? One on the key benefits we get from partitioning data this way is we keep our cores' cpu caches nice and hot. Concurrency synchronization in particular can absolutely wreck cache performance, writing to some shared section of memory can invalidate other cores' caches, causing them to be cleared, and giving the cpu a lot of extra coordination work to do. We avoid that entirely in this setup. We can take this cpu cache benefit a step further as well, with a general approach to building the system that I like to call "pipelines, not methods". In general, a lot of software is built with the idea of executing logic on single pieces of data (read: methods on a type). For systems like this that want a lot of throughput, the mechanism for achieving that usually looks like executing many small flows of logic on individual pieces of data, simultaneously (this is essentially what go's runtime does). The approach I want to take with this system is to basically flip that mindset. Rather than lots of logical flows operating on individual pieces of data, we want a single large "pipeline" that encodes the logic of the system, and processes a large batch of data at once. This should give us even more cache locality benefits, and provide a ton of throughput.

The elephant in the room here of course is handling io with this setup. If we don't have a good io mechanism, we can very easily bury all of the cache benefits we got from the pipelined approach and waste a ton of time blocking on io. Ideally, we want to do our io asynchronously, or more specifically, in a non-blocking way. We want to be able to issue io, and do more work while we're waiting on it to finish. io_uring is the perfect fit here. We can submit io for both file access and network connections without blocking, and make progress on other work while we wait for that io to complete. Additionally, it should fit very well with our batched pipeline approach, and dramatically reduce the number of syscalls in our system. I won't go into a ton of other detail about io_uring here, there's a lot of people talking about it already. Needless to say, it fits really well with the rest of the system, and in my opinion, is kind of a no brainer for greenfield io-heavy projects.

Transactions: sometimes you want a single thread

Finally, it's time to talk about transactions. Transactions are the fundamental "unit of work" in any oltp system, and really, they're kind of an incredible thing to support. If we take a step back, our interface is essentially saying that users can execute any arbitrary set of reads and writes, and that set of work will be made durable, and won't interfere or conflict with any of the other transactions (of which there are N) running concurrently in our system. That's a very strong contract to support, and it's not easy to do well! Transactions have degrees of "hardness" though, and a lot of users don't need a system with the strongest guarantees. In general, this trade off between strong guarantees and efficiency is thought of through the lens of isolation levels. Traditionally, isolation levels have existed on a sliding scale in this place, you can lose some guarantees, and gain performance, or vice versa. A recent trend, and one that fits really well with our log structured approach, is to support something called snapshot isolation. Snapshot isolation essentially just says that transactions see a private "snapshot" of the data in your system, and its generally implemented by allowing multiple versions of data to exist in the system (usually called MVCC). Lucky for us, we're already keeping multiple versions of data in our system through our log structured approach, so this fits really nicely. Snapshot isolation breaks the traditional "sliding scale" dynamic of isolation levels by generally offering one of the most performant implementations of transactions, and providing all but one of the guarantees you might want from a transaction (the conflict that it doesn't handle is called write skew, and a lot of systems simply don't have this kind of conflict).

One of the tricky things about transactions is that they kind of want the appearance of running sequentially on a single thread, but you need an extreme amount of concurrency to even get reasonable performance out of a database system like this. Amazingly, a great way to make things seem like they're running on a single thread is to...run them on a single thread 🤯. Ok before you get the pitchforks out, let me explain. A lot of concurrency control is about doing this elaborate dance to coordinate the subset of work that needs to appear sequential, what I'm proposing is to essentially try to extract this subset of work, and run these pieces on a single authoritative "main thread". For example, managing the WAL, and checking for write conflicts is a super natural fit here. Shards can just collect writes for transactions locally, and issue reads to other shards via message passing, then at commit time, we can send the transaction to the main thread, which will in turn, check the writes for any conflicts, and if successful, commit the transaction and issue those writes to their respective shards. In general, having a single authoritative thread will be a huge help in a lot of other areas too, and should allow us to simplify a lot of things (for example, the connection load balancing, like I mentioned). Of course, we have to be careful that this thread doesn't become a bottleneck, and we'll need extensive profiling for this, but my intuition is that it should work quite well (I have a feeling that the main thread may even end up under utilized).

That's all I'll go into for now. I've really enjoyed applying this "mechanical sympathy" philosophy to building this system. It's generally lead me to performance-first solutions that are also simple and easy to understand. Perhaps the most enjoyable piece is how well all of these decisions have reinforced each other. To me that's a good sign that the general principles here are the right ones, and it gives me a lot of confidence that I'll be able to tackle the difficult implementation challenges with the right mindset. Feel free check out the repo here, and stay tuned for more posts!