Kyle Edwards

Abstractions of Database Design

At any point in time, the state of a database can be thought of as the accumulation of all committed state changes applied in order up to the last. Depending on how a database is implemented, whether various data structures live in memory or on disk, a database may or may not be able to guarantee durability. If the power goes out before it gets a chance to write changes to the disk, those changes are lost.

It’s common for databases to write inserts and updates to a write-ahead log. If we consider the state of a distributed system, then write-ahead logging is effectively equivalent to the event sourcing architecture pattern.

Above the WAL, we commonly see an in-memory index. For databases like Google’s Bigtable or Apache’s Cassandra, this takes the form of an in-memory tree structure (AVL tree, red-black tree, etc…) called a memtable.

As writes are persisted to the write-ahead log, they are also added to the memtable. Periodically these memtables are flushed and written to disk in the form of an SSTable (Sorted Strings Table). Each SSTable holds a sorted disk representation of the data in the memtable. This means that multiple SSTables may have the same key, but reads can be performed in reverse (WAL (maybe), memtable, SSTablen, SSTablen-1, …, SSTable1 until the key is found.

Obviously this would mean that keys that don’t exist in the database perform a great deal of work for nothing. Bloom Filters are used to optimize reads of some keys that don’t exist in the database (not all keys are optimized due to the properties of Bloom Filters).

On the relational side, B-trees and B+-trees are essentially multi-level indexes into a table by a specific key. They are structured on-disk to be local for quick retrieval.

Some abstractions can be inferred here:

*I’m making the assumption here that equivalence does not mean congruence (events may appear in different orders if they are guaranteed to not modify the same state; see logical clocks, CDRTs, etc…).